July 7, 2010

Alfresco, NOSQL, and the Future of ECM

Alfresco wants to be a best-in-class repository for you to build your content-centric applications on top of. Interest in NOSQL repositories seems to be growing, with many large well-known sites choosing non-relational back-ends. Are Alfresco (and, more generally, nearly all ECM and WCM vendors) on a collision course with NOSQL?

First, let’s look at what Alfresco’s been up to lately. Over the last year or so, Alfresco has been shifting to a “we’re for developers” strategy in several ways:

Repositioning their Web Content Management offering not as a non-technical end-user tool, but as a tool for web application developers
Backing off of their mission to squash Microsoft SharePoint, positioning Alfresco Share instead as “good enough” collaboration. (Remember John Newton’s slide showing Microsoft as the Death Star and Alfresco as the Millenium Falcon? I think Han Solo has decided to take the fight elsewhere.)
Making Web Scripts, Surf, and Web Studio part of the Spring Framework.
Investing heavily in the Content Management Interoperability Services (CMIS) standard. The investment is far-reaching–Alfresco is an active participant in the OASIS specification itself, has historically been first-to-market with their CMIS implementation, and has multiple participants in CMIS-related open source projects such as Apache Chemistry.

They’ve also been making changes to the core product to make it more scalable (“Internet-scalable” is the stated goal). At a high level, they are disaggregating major Alfresco sub-systems so they can be scaled independently and in some cases removing bottlenecks present in the core infrastructure. Here are a few examples. Some of these are in progress and others are still on the roadmap:

Migrating away from Hibernate, which Alfresco Engineers say is currently a limiting factor
Switching from “Lucene for everything” to “Lucene for full-text and SQL for metadata search”
Making Lucene a separate search server process (presumably clusterable)
Making OpenOffice, which is used for document transformations, clusterable
Hiring Tom Baeyens (JBoss jBPM founder) and starting the Activiti BPMN project (one of their goals is “cloud scalability from the ground, up”)

So for Alfresco it is all about being an internet-scalable repository that is standards-compliant and has a rich toolset that makes it easy for you to use Alfresco as the back-end of your content-centric applications. Hold that thought for a few minutes while we turn our attention to NOSQL for a moment. Then, like a great rug, I’ll tie the whole room together.

NOSQL Stores

A NOSQL (“Not Only SQL”) store is a repository that does not use a relational database for persistence. There are many different flavors (document-oriented, key-value, tabular), and a number of different implementations. I’ll refer mostly to MongoDB and CouchDB in this post, which are two examples of document-oriented stores. In general, NOSQL stores are:

Schema-less. Need to add an “author” field to your “article”? Just add it–it’s as easy as setting a property value. The repository doesn’t care that the other articles in your repository don’t have an author field. The repository doesn’t know what an “article” is, for that matter.
Eventually consistent instead of guaranteed consistent. At some point, all replicas in a given cluster will be fully up-to-date. If a replica can’t get up-to-date, it will remove itself from the cluster.
Easily replicate-able. It’s very easy to instantiate new server nodes and replicate data between them and, in some cases, to horizontally partition the same database across multiple physical nodes (“sharding”).
Extremely scalable. These repositories are built for horizontal scaling so you can add as many nodes as you need. See the previous two points.

NOSQL repositories are used in some extremely large implementations (Digg, Facebook, Twitter, Reddit, Shutterfly, Etsy, Foursquare, etc.) for a variety of purposes. But it’s important to note that you don’t have to be a Facebook or a Twitter to realize benefits from this type of back-end. And, although the examples I’ve listed are all consumer-facing, huge-volume web sites, traditional companies are already using these technologies in-house. I should also note that for some of these projects, scaling down is just as important as scaling up–the CouchDB founders talk about running Couch repositories in browsers, cell phones, or other devices.

If you don’t believe this has application inside the firewall, go back in time to the explosive growth of Lotus Notes and Lotus Domino. The Lotus Notes NSF store has similar characteristics to document-centric NOSQL repositories. In fact, Damien Katz, the founder of CouchDB, used to work for Iris Associates, the creators of Lotus Notes. One of the reasons Notes took off was that business users could create form-based applications without involving IT or DBAs. Notes servers could also replicate with each other which made data highly-available, even on networks with high latency and/or low bandwidth between server nodes.

Alfresco & NOSQL

Unlike a full ECM platform like Alfresco, NOSQL repositories are just that–repositories. Like a relational database, there are client tools, API’s, and drivers to manage the data in a NOSQL repository and perform administrative tasks, but it’s up to you to build the business application around it. Setting up a standalone NOSQL repository for a business user and telling them to start managing their content would be like sticking them in front of MySQL and doing the same. But business apps with NOSQL back-ends are being built. For ECM, projects are already underway that integrate existing platforms with these repositories (See the DrupalCon presentation, “MongoDB – Humongous Drupal“, for one example) and entirely new CMS apps have been built specifically to take advantage of NOSQL repositories.

What about Alfresco? People are using Alfresco and NOSQL repositories together already. Peter Monks, together with others, has created a couple of open source projects that extend Alfresco WCM’s deployment mechanism to use CouchDB and MongoDB as endpoints (here and here).

I recently finished up a project for a Metaversant client in which we used Alfresco DM to create, tag, secure, and route content for approval. Once approved, some custom Java actions deploy metadata to MongoDB and files to buckets on Amazon S3. The front-end presentation tier then queries MongoDB for content chunks and metadata and serves up files directly from Amazon S3 or Amazon’s CloudFront CDN as necessary.

In these examples, Alfresco is essentially being used as a front-end to the NOSQL repository. This gives you the scalability and replication features on the Content Delivery tier with workflow, check-in/check-out, an explicit content model, tagging, versioning, and other typical content management features on the Content Management tier.

But why shouldn’t the Content Management tier benefit from the scalability and replication capabilities of a NOSQL repository? And why can’t a NOSQL repository have an end-user focused user interface with integrated workflow, a form service, and other traditional DM/CMS/WCM functionality? It should, it can and they will. NOSQL-native CMS apps will be developed (some already exist). And existing CMS’s will evolve to take advantage of NOSQL back-ends in some form or fashion, similar to the Drupal-on-Mongo example cited earlier.

What does this mean for Alfresco and ECM architecture in general?

Where does that leave Alfresco? It seems their positioning as a developer-focused, “Internet-scale” repository ultimately leads to them competing directly against NOSQL repositories for certain types of applications. The challenge for Alfresco and other ECM players is whether or not they can achieve the kind of scale and replication capabilities NOSQL repositories offer today before NOSQL can catch up with a new breed of Content Management solutions built expressly for a world in which content is everywhere, user and data volumes are huge and unpredictable, and servers come and go automatically as needed to keep up with demand.

If Alfresco and the overwhelming majority of the rest of today’s CMS vendors are able to meet that challenge with their current relational-backed stores, NOSQL simply becomes an implementation choice for CMS vendors. If, however, it turns out that being backed by a NOSQL repository is a requirement for a modern, Internet-scale CMS, we may see a whole new line-up of players in the CMS space before long.

What do you think? Does the fundamental architecture prevalent in today’s CMS offerings have what it takes to manage the web content in an increasingly cloud-based world? Will we see an explosion of NOSQL-native CMS applications and, if so, will those displace today’s relational vendors or will the two live side-by-side, potentially with buyers not even knowing or caring what choice the vendor has made with regard to how the underlying data is persisted?

16 comments

July 7, 2010 at 7:27 am

Snig Bhaumik says:

Great article, Jeff.
Looks like NOSQL is the future .

However, does the methodology restricts itself only to CMS solutions? As it probably is not suitable for RDBMS bases applications.
July 7, 2010 at 9:51 am

Shane K Johnson says:

I think you’ll see two paths.

On one hand you’ll see new CMS vendors using various NOSQL solutions. On the other, you’ll see current vendors developing new persistence managers for Jackrabbit.

This is where Alfresco really missed the boat. Those vendors using Jackrabbit are already positioned to take advantage of these persistence managers. For example, there was some work done to create an HBase persistence manager.

In addition you have products like ModeShape that offer JCR access to a variety of stores. The best example is backing it with Infinispan. You now have a completely scalable content store.

That being said I doubt you’ll see NOSQL only. Some vendors may choose to still put certain metadata in a replicated database.

Regarding Lucene, Solr or ElasticSearch are great options.

Regarding scalability, Alfresco pretty much needs to be completely rewritten from the ground up. There are bottlenecks all over the place.
July 8, 2010 at 12:47 am

Peter Monks says:

@Jeff: I’ve reached much the same conclusion as you – nosql is not a competitor to CMSes; rather it’s more likely to be an enabler of the next generation of CMSes.

<shameless plug>That said it’s not all peaches and cream – I’m currently in the middle of a more in depth discussion of this topic. Lots yet to be said!</>

@Shane: I think your confidence in JCR may be somewhat misplaced – that ship sailed a long time ago and didn’t get very far. CMIS is already more widely adopted than JCR ever was, and (to verge on speculation for a moment) if the rumours I’ve heard are true, even the One True Source of JCR (Day Software) are increasingly distancing themselves from it, which could spell the final death knell for JCR.
July 9, 2010 at 11:44 am

Matt Hamilton says:

I’ll be interested to see where this all goes. We predominantly use Plone for any of our CMS projects. Plone, being based on Zope and the ZODB (http://www.zodb.org/) has been doing NOSQL for over a decade.

Technically it is fantastic, and I really haven’t understood why all other systems up until now have stuck to using relational databases for CMSes (which are generally not very relational)…. however… it has always been a hard ‘sell’ to clients.

client: “What database does it use?”
us: “The ZODB, a high performance transactional object store”
client: “OK, we’ll get our DBA’s to take a look at it”
dba: “Uhhh… why is it not Oracle?”

In most people’s minds Database == RDBMS. They just don’t seem to be able to think beyond that. If it doesn’t use SQL then they just don’t get it.

What will be interesting, is one of the reasons the Plone community has not been all that excited about CMIS is partly due to lack of genuine interest (we’ve already got so many standard protocols to exchange data, why bother with another — yes that is a simplistic view, but valid in 95% of cases) but also because CMIS in itself (or at least the query language) makes an assumption in its design that the underlying storage is an RDBMS. CMIS query language is basically SQL-like. This actually makes it harder to implement it on top of non-SQL-based storages… such as the ZODB. You’d have to maps the ideas of CMIS onto the object oriented nature of the NOSQL database.

So I’ll be very interested to see how the likes of Alfresco mate the ideas of CMIS with those of NOSQL and how deep the decide to integrate the two.

-Matt
July 9, 2010 at 12:46 pm

Shane K Johnson says:

@Peter: As numerous other blogs have stated, CMIS and JCR are not competitors. That’s like saying CMIS is better Java. They are not even related. That, and CMIS works on top of JCR. CMIS is a high level REST API. JCR is a low level Java API. CMIS is useful to expose your repository’s content/actions to your application clients, but it is not what you use to actually build your repository application. Now more vendors, JCR or not, may be adding CMIS but we have yet to see any real, practical benefits. While integrating a portal with Alfresco, we ended up writing our own web scripts because CMIS either performed poorly or it did not offer what we needed. We tried to use CMIS only, but it just didn’t work out.

Now it doesn’t have to be JCR per say, but a universal, developer centric API that allows you to abstract the physical persistence is a great thing. There are also a few libraries being worked on that allow you to abstract your NOSQL implementation. That is pretty neat too.

If you remove JCR, what do you have? SQL? That is simply not going to scale. You could go with something proprietary, but what do you do when you want to change your persistence? Like right now, how much work do you think it would take Alfresco to rewrite their persistence to say a distributed file system or data grid? Well, if they used Jackrabbit all they would have to do is substitute the persistent manager for one that supports a distributed file system such as Hadoop. It they used ModeShape they would just have to switch the connector out for the Infinispan connector.

Also, every CMS is hierarchical so you need something “like” JCR rather than SQL or a simple key/value API that has no notion of a hierarchy. This is why Jackrabbit is nice. It can persist to a database or a key/value store but still provide the developer with a hierarchical view.

Rather than say confidence, I’d say preference. Though considering the number of JCR based repositories out there (and those currently migrating to it), I’d say it has been successful. I have been able to work on several different JCR based repositories equally as well since they all used the same API. I didn’t have to learn a new proprietary API, and that’s what I prefer 😉

Let’s put it this way. Why don’t you want to use JCR?
July 9, 2010 at 4:55 pm

Peter Monks says:

@Shane, there’s a lot I don’t like about JCR, but in the interests of brevity I’ll mention just 2 of the more egregious problems I see with it:
1. the “J”
2. the imposition of a hierarchy into every single content model

#1 has been commented on ad nauseam since JCR first saw the light of day. One thing to keep in mind is that I consider myself a Java guy and I consider it a huge mistake – I can’t even begin to imagine how ridiculous this looks to non-Java CMS folks (who comprise the majority of the CMS community!).

Regarding #2, you’re mistaken that “every CMS is hierarchical” – to pick one example off the top of my head: OTEX VCM is not hierarchical at its core (though it layers in some hierarchies at higher layers for implementer convenience).

The real issue is that there are perfectly valid content models that don’t map to the simplistic “one size fits all” hierarchy mandated by JCR. They might be “flat” (document db-like) or perhaps require more sophisticated data structures (multi-hierarchies, digraphs, …) in order to support some complex navigational scheme on the end site.

Having to shoehorn such a model into a simplistic pre-baked hierarchy is an unnecessary hassle for the implementer. It would be much better if the content modeling facilities provided by the CMS supported hierarchies (and multi-hierarchies, and digraphs, and free-form graphs, and …), perhaps even shipped with samples showing how they are done, but did NOT mandate them.

I should mention that item #2 applies to CMIS almost as much as JCR, although at least CMIS has a reasonable notion of “Unfiled Content”. Now I realise that JCR “supports” that too… …but it does so by filing “unfiled” content in a special system hierarchy! If that isn’t sidestepping the issue I don’t know what is! 😉
July 10, 2010 at 4:59 pm

Juerg Meier says:

@jeff: ever since I started working in the ECM camp on platforms from Open Text, IBM, Documentum and noteably Alfresco (“only DMS kernel developed in this century”), I have been convinced that all of them are too strongly integrated. They come all-in-one: persistence for both meta and binary data, workflow engine, search engine, security, GUI technology, etc… Replacing one of these elements is typically a nightmare.

I therefore expect that we will see a new generation of ECM architectures with much more decoupled and highly dedicated components, who communicate via standards.

As you write it nicely: “The challenge for Alfresco and other ECM players is whether or not they can achieve the kind of scale … for a world in which content is everywhere, user and data volumes are huge and unpredictable”. I think they could, but they must shift away from silo- and “we can do everything”-thinking.

-Juerg
July 14, 2010 at 8:58 am

Nate says:

New technology is appealing to the developer but to the company, there has to be a more practical and compeling reason than to simply have a better mouse trap. The cost of getting in-house developers skilled enough to support a new environment is a big reason not to move to some of these new technologies. And if you find that you have to replace a skillset, then good luck finding someone with professional experience.
Many nice technologies come and go because the corporation has to make a decision whether or not to implement new tech for a long term strategic advantage or stay where they are; which has been working fine anyway.
I suspect that this technology will not become common for most companies. They will stay with SQL because they can find experienced sql developers anywhere. Google, Digg and facebook are pioneering companies that can afford to implement new tech given that they are all social media. Not too much risk here in choosing new tech. I doubt that these companies use this tech for their financial or HR databases. Time will tell but I see this tech remaining in the realm of technology enthusiast.
July 14, 2010 at 10:18 am

jpotts says:

For any given new technology, there are traditional companies who, for whatever reason, will find that the risk is worth the potential benefit. These early adopters learn lessons that eventually benefit the followers, lowering the risk and attracting broader adoption and so on. This is the classic adoption curve that lead to SQL being used in the workplace originally (and every other piece of technology that has come before).

What I think you’re saying is that you don’t see anything compelling enough about this particular database model for it to become broadly adopted in non-web-centric companies. It seems to me that there are plenty of applications in the non-web world that NOSQL would be particularly better-suited for than relational databases. I’m not saying relational databases will be replaced completely by NOSQL alternatives. I’m saying that not having a “relational default” is a good thing and that developers are going to use the tool that is easiest to work with and is a best fit for the requirements at hand.

Bringing it back to ECM, I think there is a good chance that vendors will take advantage of this technology because it may allow them to achieve better scale, replication, and model flexibility than they can currently achieve with a relational back-end.

From an ECM customer’s perspective, how much longer will customers actually care what’s going on in the back-end as long as the product delivers on all of the “-ilities” needed for their solution? A customer cares about the back-end to the extent that (1) they have to pay licensing and support costs for the server and (2) they have someone that knows how to take care of it, as you mentioned. As costs continue to trend toward zero the first item becomes less of a concern in the long-term. Relational? Graph? Document-oriented? Who cares. Just pay your “data persistence” bill when Amazon or Google sends it to you.

I agree that, depending on what you’re doing, someone in your organization has to learn the technology to some extent in order to utilize it effectively. But I wonder if, in the ECM case, vendors will take on some of the burden. This already happens today with products like Alfresco where the DBA is rarely involved. Yes, they take care of the database server care and feeding, and they may do some performance tuning, but the DBA is completely uninvolved in the database details. Alfresco and iBATIS/Hibernate take care of that. Or look at Day Software. Their products don’t require a relational database at all. CRX and CQ install with the Tar Persistence Manager by default. I doubt that Day’s customers see Day’s choice of persistence as exposing them to any more risk than is already present in a CMS purchase decision.

I’ll go back to the Lotus Notes example. Notes didn’t use a relational back-end. It used a document-oriented store called NSF. When a company rolled out Notes, they didn’t have to hire NSF experts. Companies with large rollouts did hire Notes Administrators, but those folks were primarily server people, not database people. They watched the server logs, made sure replication was happening when it was supposed to, kept the software up-to-date, managed users, groups, and permissions, and maybe did some troubleshooting or support for the apps that ran on the server. Extremely small businesses or big companies with departments running servers under their desks weren’t likely to hire dedicated administrators at all. I think we’re headed in this direction with ECM, eventually even for extremely large, clustered implementations.

So, I disagree that this is technology for “enthusiasts” only. I think we will see enterprise adoption. The question is how long will it take and what kinds of cool apps will we see to take advantage of it.

Jeff
July 15, 2010 at 9:44 am

Nate says:

You may be correct. I have noticed in news articles that some companies do try to stay current and innovative.
I forget that there are trend setting companies out there.
July 15, 2010 at 2:17 pm

Matt Hamilton says:

What I found useful when dealing with systems which use non-relational backend is to just never utter the word ‘database’. Object store, persistence engine, document repository, all fine. Just never say ‘database’ as still most people can’t distinguish that word from RDBMS.

-Matt
Pingback: Migración de Websites a Alfresco WCM « • Holistic Security •
December 2, 2010 at 8:12 am

ukdavo says:

An old thread but interesting. Have you tried Lily yet? http://www.lilyproject.org/lily/index.html.
December 2, 2010 at 9:16 am

jpotts says:

I took a look at Lily back in July when it was first available but it was still very rough. Probably time to circle back and do a proper eval. Have you spent any time with it?
Pingback: Back to the Future of Content Repositories | ECM Architect
Pingback: Back to the Future of Content Repositories – Technology Up2date

Comments are closed.