Category: NoSQL

A NoSQL repository uses something other than a relational model for persistence. They tend to be highly-scalable and come in a variety of flavors from key-value stores to graph stores to document stores.

Alfresco Anti-Patterns: When You Probably Shouldn’t Use Alfresco

There are plenty of write-ups listing what Alfresco can do–I thought it might be instructive to list the things people often try to use Alfresco for but shouldn’t. I’ve got five examples in my list. The first two are common mistakes people make during product selection. The last three are more architectural.

Anti-Pattern #1: Dynamic Web Content Management (like Drupal or WordPress)

I think this is happening less, but every once in-a-while I’ll still see people trying to compare Alfresco to dynamic WCM platforms like Drupal or WordPress. Alfresco has very little in common with systems like these. If you install Alfresco and expect it to serve up a pretty web site out-of-the-box with downloadable themes and tons of modules or widgets you can use to add features to your web site, you’ll be disappointed. This isn’t a shortcoming of the tool, it’s just not what it was built for.

There are plenty of people who use Alfresco to manage assets that are eventually served up to the web. They’ll use Alfresco Share or a custom UI as the “administrative” interface for managing content. Then, they’ll push that content out to some other system on the presentation tier (Saks Fifth Avenue and New York Philharmonic are two examples).

There are partners who have created WCM solutions on top of Alfresco (see Crafter). Solutions like that leverage the power of Alfresco as a content repository and then add in the missing pieces, which are mostly about presentation layer, site building, and content creation.

The bottom-line is if you find yourself comparing out-of-the-box Alfresco to systems like Drupal or Wordress you have made a mistake in your evaluation.

Anti-Pattern #2: Full-featured wiki, portal, blog, forums, or calendar

I’ve encountered several people looking to replace major collaboration systems in their IT footprint with Alfresco. Maybe they’ve decided to use Alfresco for document management, but they want to see what else they might be able to replace. They have a wiki they want to replace, they see Alfresco has a wiki. Problem solved, right? This is where box-checking against a feature list gets you into trouble.

Alfresco is a document management repository with a powerful embedded workflow engine. Alfresco Share, the web client that sits on top of Alfresco, is great for basic document management, processes around documents, and team collaboration.

For teams and projects, Alfresco Share uses a “site” metaphor to keep everything related to that team or project together. Each site has a dashboard. Out-of-the-box “dashlets” can be used to summarize or highlight information stored in the site. Out-of-the-box, everyone sees the same dashboard for a site, which is configured by a site manager. There is no easy way for a power user to specify which dashlets should be restricted to which users or groups of users through the UI like there would be in a portal, for example. So, although dashlets look like “portlets” Alfresco Share doesn’t really have much else in common with portals. If you what you really want is a full-blown portal server you should look at something like Liferay or Exo.

Each site can also be configured with a number of collaborative tools such as discussions, blog, wiki, and calendar. These are more than adequate to facilitate most of what a team, project, or department needs. But none of them individually are going to replace full-featured, standalone systems. If you need the power of a full wiki, install MediaWiki. If you need a blog server, install WordPress. And so on.

Those are two where I see people making adjustments in their expectations early in the product evaluation phase. Now let’s look at a few that may not get uncovered until an architect or developer gets involved…

Anti-Pattern #3: Highly relational solutions

Alfresco relies on three main pillars to deliver its functionality: The file system, a search engine (Lucene or Solr), and a relational database. But you won’t be touching any of those directly. Instead, you’ll work with an abstraction which is simply, “the repository”.

Don’t be misled by the inclusion of a relational database as one of its dependencies. It is there to manage metadata. As you start to customize Alfresco to meet your specific requirements, you’ll define the content model. Alfresco will do the work of reading your content model and storing metadata for instances of those content types in the database.

Objects in the repository can be related to each other through “associations”. These are essentially pointers between one or more objects. There are a couple of challenges with these. First, they cannot easily be queried. You can ask an object for its associations and then you can iterate over those, but you cannot do a traditional “join” across objects.

For example, suppose you have a “whitepaper” object that has an association to one or more “product” objects. You cannot execute a single query that says “Give me all whitepapers containing the word ‘performance’ that are associated with the product named ‘Acme Widget'”.

One way people work around this is to de-normalize their data, then implement code that keeps it in sync. In this example, you could add a multi-value property on the whitepaper object that would store the names of the products a whitepaper is related to. Then you’d be able to run that example query.

If the name stored on the product object changes, your code would trigger an update on all corresponding whitepapers to keep the product name in sync. If you have a small number of such relationships with a reasonable number of objects on either side of the relationship this is fine, but you can see how it might quickly get out-of-hand.

So if your underlying data is highly-relational, don’t try to force it into an Alfresco content model. Instead, move the relational data to a database and use Alfresco only for the content pieces.

Anti-Pattern #4: JSON/XML object store

It’s really common to store chunks of JSON or XML as content in Alfresco. For example, maybe you have some data that isn’t expressed well as name-value pairs. Or maybe the content you need to manage just happens to be in one of those formats. But if that’s all you need to persist in the repository you really ought to be asking yourself why you are using Alfresco when there are many lighter-weight, more scalable technologies that are purpose-built for this.

One limitation of storing JSON or XML as content in Alfresco is that the repository has no semantic understanding of the content. For example, suppose you have a book object that is represented by JSON and you store that JSON as content. It’s likely that the JSON would contain properties like “title”, “author”, or “ISBN”. Out-of-the-box, none of those will be queryable by property. Alfresco will simply attempt to full-text index the content like any other content stream. It doesn’t understand the difference between “title” and “author” because that meaning is embedded in the content itself, not the object. The same is true for XML.

You can work around this by setting up metadata extractors to grab data out of the JSON or XML and store it in properties on the object. Then, you can query the object’s properties through Alfresco. But if all of your objects are similarly-structured it might make more sense to use a document-oriented NoSQL repository or an XML database instead. When you store a JSON document in something like Elasticsearch, Couch, or MongoDB, no extra work is necessary because those systems natively understand JSON.

Anti-Pattern #5: Storing lots of content-less objects

A content-less object is an object that lacks a content stream. It’s common to have one or two types of content-less objects in your Alfresco-based solution because there are usually good reasons to have objects that don’t have a file associated with them. Maybe you are storing some configuration as properties on an object, for example. But if you need to store nothing but content-less objects, you are throwing away many of the benefits you get from a repository like Alfresco that is built specifically for managing file-based content like full-text search, transformations, and file-based protocols.

If you just need to store objects that have properties but no file-based content, you might be better of with a document-oriented NoSQL repository or a key-value store.

Summary

As I mentioned at the start of the post, there are a lot of cases where Alfresco makes sense and you can find many of these around the net. The goal of this post was to list common misconceptions or even misuses of Alfresco that can cost you time and money.

Any time you invest in a platform you’ll find corner cases that the platform wasn’t meant to address and you can often work around those with code. What you don’t want to do, though, is have your entire system be a corner case relative to the platform’s sweet spot. That’s no fun for anybody.

Alfresco, NOSQL, and the Future of ECM

Alfresco wants to be a best-in-class repository for you to build your content-centric applications on top of. Interest in NOSQL repositories seems to be growing, with many large well-known sites choosing non-relational back-ends. Are Alfresco (and, more generally, nearly all ECM and WCM vendors) on a collision course with NOSQL?

First, let’s look at what Alfresco’s been up to lately. Over the last year or so, Alfresco has been shifting to a “we’re for developers” strategy in several ways:

  • Repositioning their Web Content Management offering not as a non-technical end-user tool, but as a tool for web application developers
  • Backing off of their mission to squash Microsoft SharePoint, positioning Alfresco Share instead as “good enough” collaboration. (Remember John Newton’s slide showing Microsoft as the Death Star and Alfresco as the Millenium Falcon? I think Han Solo has decided to take the fight elsewhere.)
  • Making Web Scripts, Surf, and Web Studio part of the Spring Framework.
  • Investing heavily in the Content Management Interoperability Services (CMIS) standard. The investment is far-reaching–Alfresco is an active participant in the OASIS specification itself, has historically been first-to-market with their CMIS implementation, and has multiple participants in CMIS-related open source projects such as Apache Chemistry.

They’ve also been making changes to the core product to make it more scalable (“Internet-scalable” is the stated goal). At a high level, they are disaggregating major Alfresco sub-systems so they can be scaled independently and in some cases removing bottlenecks present in the core infrastructure. Here are a few examples. Some of these are in progress and others are still on the roadmap:

  • Migrating away from Hibernate, which Alfresco Engineers say is currently a limiting factor
  • Switching from “Lucene for everything” to “Lucene for full-text and SQL for metadata search”
  • Making Lucene a separate search server process (presumably clusterable)
  • Making OpenOffice, which is used for document transformations, clusterable
  • Hiring Tom Baeyens (JBoss jBPM founder) and starting the Activiti BPMN project (one of their goals is “cloud scalability from the ground, up”)

So for Alfresco it is all about being an internet-scalable repository that is standards-compliant and has a rich toolset that makes it easy for you to use Alfresco as the back-end of your content-centric applications. Hold that thought for a few minutes while we turn our attention to NOSQL for a moment. Then, like a great rug, I’ll tie the whole room together.

NOSQL Stores

A NOSQL (“Not Only SQL”) store is a repository that does not use a relational database for persistence. There are many different flavors (document-oriented, key-value, tabular), and a number of different implementations. I’ll refer mostly to MongoDB and CouchDB in this post, which are two examples of document-oriented stores. In general, NOSQL stores are:

  • Schema-less. Need to add an “author” field to your “article”? Just add it–it’s as easy as setting a property value. The repository doesn’t care that the other articles in your repository don’t have an author field. The repository doesn’t know what an “article” is, for that matter.
  • Eventually consistent instead of guaranteed consistent. At some point, all replicas in a given cluster will be fully up-to-date. If a replica can’t get up-to-date, it will remove itself from the cluster.
  • Easily replicate-able. It’s very easy to instantiate new server nodes and replicate data between them and, in some cases, to horizontally partition the same database across multiple physical nodes (“sharding”).
  • Extremely scalable. These repositories are built for horizontal scaling so you can add as many nodes as you need. See the previous two points.

NOSQL repositories are used in some extremely large implementations (Digg, Facebook, Twitter, Reddit, Shutterfly, Etsy, Foursquare, etc.) for a variety of purposes. But it’s important to note that you don’t have to be a Facebook or a Twitter to realize benefits from this type of back-end. And, although the examples I’ve listed are all consumer-facing, huge-volume web sites, traditional companies are already using these technologies in-house. I should also note that for some of these projects, scaling down is just as important as scaling up–the CouchDB founders talk about running Couch repositories in browsers, cell phones, or other devices.

If you don’t believe this has application inside the firewall, go back in time to the explosive growth of Lotus Notes and Lotus Domino. The Lotus Notes NSF store has similar characteristics to document-centric NOSQL repositories. In fact, Damien Katz, the founder of CouchDB, used to work for Iris Associates, the creators of Lotus Notes. One of the reasons Notes took off was that business users could create form-based applications without involving IT or DBAs. Notes servers could also replicate with each other which made data highly-available, even on networks with high latency and/or low bandwidth between server nodes.

Alfresco & NOSQL

Unlike a full ECM platform like Alfresco, NOSQL repositories are just that–repositories. Like a relational database, there are client tools, API’s, and drivers to manage the data in a NOSQL repository and perform administrative tasks, but it’s up to you to build the business application around it. Setting up a standalone NOSQL repository for a business user and telling them to start managing their content would be like sticking them in front of MySQL and doing the same. But business apps with NOSQL back-ends are being built. For ECM, projects are already underway that integrate existing platforms with these repositories (See the DrupalCon presentation, “MongoDB – Humongous Drupal“, for one example) and entirely new CMS apps have been built specifically to take advantage of NOSQL repositories.

What about Alfresco? People are using Alfresco and NOSQL repositories together already. Peter Monks, together with others, has created a couple of open source projects that extend Alfresco WCM’s deployment mechanism to use CouchDB and MongoDB as endpoints (here and here).

I recently finished up a project for a Metaversant client in which we used Alfresco DM to create, tag, secure, and route content for approval. Once approved, some custom Java actions deploy metadata to MongoDB and files to buckets on Amazon S3. The front-end presentation tier then queries MongoDB for content chunks and metadata and serves up files directly from Amazon S3 or Amazon’s CloudFront CDN as necessary.

In these examples, Alfresco is essentially being used as a front-end to the NOSQL repository. This gives you the scalability and replication features on the Content Delivery tier with workflow, check-in/check-out, an explicit content model, tagging, versioning, and other typical content management features on the Content Management tier.

But why shouldn’t the Content Management tier benefit from the scalability and replication capabilities of a NOSQL repository? And why can’t a NOSQL repository have an end-user focused user interface with integrated workflow, a form service, and other traditional DM/CMS/WCM functionality? It should, it can and they will. NOSQL-native CMS apps will be developed (some already exist). And existing CMS’s will evolve to take advantage of NOSQL back-ends in some form or fashion, similar to the Drupal-on-Mongo example cited earlier.

What does this mean for Alfresco and ECM architecture in general?

Where does that leave Alfresco? It seems their positioning as a developer-focused, “Internet-scale” repository ultimately leads to them competing directly against NOSQL repositories for certain types of applications. The challenge for Alfresco and other ECM players is whether or not they can achieve the kind of scale and replication capabilities NOSQL repositories offer today before NOSQL can catch up with a new breed of Content Management solutions built expressly for a world in which content is everywhere, user and data volumes are huge and unpredictable, and servers come and go automatically as needed to keep up with demand.

If Alfresco and the overwhelming majority of the rest of today’s CMS vendors are able to meet that challenge with their current relational-backed stores, NOSQL simply becomes an implementation choice for CMS vendors. If, however, it turns out that being backed by a NOSQL repository is a requirement for a modern, Internet-scale CMS, we may see a whole new line-up of players in the CMS space before long.

What do you think? Does the fundamental architecture prevalent in today’s CMS offerings have what it takes to manage the web content in an increasingly cloud-based world? Will we see an explosion of NOSQL-native CMS applications and, if so, will those displace today’s relational vendors or will the two live side-by-side, potentially with buyers not even knowing or caring what choice the vendor has made with regard to how the underlying data is persisted?

Notes from OSCON 2009 in San Jose

I’m back from San Jose. My colleage, Dave Gynn, and I had fun at the O’Reilly Open Source Conference (OSCON) and learned a lot. Dave’s ability to pick out open source rockstars from a crowd is uncanny. It was pretty sweet seeing Larry Wall (and his family) hanging out and then hearing him speak. Although there are all kinds of topics on all things Open Source, the conference does have a heavy Perl bias.

Dave and I decided we were glad we went but we don’t feel like we have to be there every year going forward. This was my first time, but Dave said the general excitement level seemed low for some reason. Maybe it was Allison Randal’s seriously downbeat welcome address. Not sure. Anyway, here are my rough notes from some of the sessions I attended…

“Open Source in Government” was a big theme at OSCON this year. Speakers tried to instill a sense of urgency in the audience by saying that the window of opportunity for getting the government behind open source in a big way will only be open for a few more months. If you want to get involved, check out some of these links:

Data.gov mash-up contest
http://sunlightlabs.com/contests/appsforamerica2/

Machine readable datasets from the US Govt
http://www.data.gov/

Help the government make better use of open source
http://www.opensourceforamerica.org/

Some folks from Liferay presented on a new UI framework they’ve created called Alloy. Alloy is aimed at providing a single framework that addresses HTML, CSS, and JavaScript in a way that is abstracted from the underlying libraries. Alloy basically extends/subclasses JQuery and YUI. Liferay is migrating a lot of their OOTB portlets now to the new framework. It is expected to ship as part of 5.3. This talk was more about the “why” and less about the “what”. I would have liked to see more examples/demos.

Went to a talk on “using Django for election audits” that turned out to be more about how screwed up our elections process is and the minutiae of performing an audit on election results with not so much on how Django was used to solve the problem. The speaker did give a shout out to the Django Debug Toolbar that might prove to be useful. The presenter is looking for help with the project. He needs everything from UI help to people who can send him election results from their local election boards.

Saw a decent talk on Apache CouchDB. Couch is a schema-less database that is built for massive distributed scalability. Instead of SQL you use map-reduce functions to query. Key to Couch is the concept of “eventual consistency”–in a Couch app, data can be consistent over time instead of right now. Couch always knows either the correct old value or the correct current value, but it may take time to propogate the current value to every node in the system.

Noteworthy bullet points:

  • Couch can idle in 4MB of RAM. With a couple of production databases Couch will use about 20MB.
  • Canonical is including Couch in the Karmic Koala release. This will give apps running on Karmic the ability to easily sync data between nodes. Couch will also be running as part of Ubuntu One which means Karmic desktops can sync data with the Ubuntu cloud (See the Ubuntu wiki).
  • Someone is currently working on a JavaScript implementation of Couch. Among other things, this would give you the ability to replicate your CouchDB to a local version of Couch running in someone’s browser.
  • Current ACL is limited to “you are either an admin or you aren’t”. ACL for writers *might* make it into 1.0. ACL for readers won’t.

I went to the “JRuby on AppEngine” talk not for the JRuby, but because it was the only Google AppEngine session I could find. I was looking for some factoids on who’s using AppEngine. Here’s what they said:

  • 200,000 registered developers
  • 85,000 applications
  • Household names such as: eBay, Best Buy, Forbes, Whitehouse.gov.

Whitehouse.gov was a cool scalability story for AppEngine. They used AppEngine to moderate questions submitted during Obama’s first online town hall. According to the Google Code blog,

“During the 48-hour open voting period, the site peaked at 700 hits per second, and 92,934 people submitted 104,073 questions and cast 3,605,984 votes. In total, over one million unique visitors visited the site before the town hall. Even while the site was featured on major news outlets and even the Google homepage the other 50,000 apps built on App Engine were fully supported and experienced no adverse effects.”

The Erlang talk provided a good history of the language. I would have liked more on the language itself and less of the detailed history behind Ericsson’s telecom switches (even though Erlang played a critical role in those products). I was aware that CouchDB is built with Erlang but the speaker mentioned a couple of other open source projects that leverage Erlang that I hadn’t heard of: ejabberd is an Erlang-based chat server and RabbitMQ is an Erlang-based messaging server.

The “building a business on an open source distributed cloud” talk by Bradford Stephens was good. The speaker’s company, Visible Technologies, mines social networks and the internet in general for consumer sentiment on its customer’s brands. Their system ingests vast subsets of the Internet, parses the results, processes it, and indexes it so that they can run analytics against it for their clients. They moved from an all-Microsoft stack to an open source stack and have been very happy with it.

This was the third “noSQL”-themed talk I saw. He made a good point that when we design apps, we should be saying, “I need persistence” and then figure out what is the best provider of that given scalability and other constraints rather than starting out with “I need a relational database”.

The open source stack used by Visible Technologies includes the usual search players (Lucene, Nutch, Solr) as well as one I haven’t heard of: Katta is used to shard large Lucene indexes across multiple servers. They also use a couple of Hadoop sub-projects, HBase and ZooKeeper, and several others.

The New York Times API and NPR API talks were very good. I didn’t realize how many different API’s NYT has exposed. You can check out their API’s around people, news, search, movies, and books at http://developer.nytimes.com. Their blog is also worth checking out.

Lots of apps have been built using the NYT API. A personal favorite is InstantWatcher. It is a mash-up of NYT’s movies API with Netflix that helps you find good movies available to watch instantly.

NPR’s talk focused less on their specific API and more on how it is being used. Noteworthy bullets:

  • You can build API calls with their query generator (requires a free API key) or by hand (doc).
  • NPR offers tiered key levels. If you create something cool and drive a little traffic their way, you can get your key upgraded to a higher tier.
  • There are no rate limits. NPR believes they have built an infrastructure that can take “anything we can throw at it”.
  • The API has 2,000 users and serves 24 million requests (per ?) averaging 2 million requests per month.
  • 50% of the API requests are for NPRML with less than 0.1% requesting ATOM. NPR API results are also available as JSON, RSS, and several other formats.
  • The NPR Digital Media team blogs at http://www.npr.org/blogs/inside/
  • Interesting side-note: NPR is currently migrating off of Oracle 10g to MySQL

After the NYT and NPR talks, they held a developer meet-up of sorts. Unfortunately I had to head to the airport so I missed out on that.

Apache CouchDB looks interesting

Here’s something to add to my “dive deeper when I have the time” list: Apache CouchDB. It’s a document database accessible via REST, which by itself isn’t terribly unique. What caught my eye was that it was built from the ground-up to be distributed. You can replicate documents across multiple nodes, maintain partial replicas, and sync for offline use. The roadmap has some significant features that need to be implemented before you most people would use it in production, but still, it’s something to keep an eye on.