Month: July 2009

Alfresco Developer Guide source reorg and 3.2 Community update

[UPDATE: Added a link to the source code that works with 3.2 Enterprise]

I originally wrote the Alfresco Developer Guide source code for Alfresco 2.2 Enterprise and Alfresco 3 Labs. The code was pretty much the same regardless of which one you were running. For things that did happen to be different, I handled those with separate projects: one for community-specific stuff and one for enterprise-specific stuff. This was pretty much limited to minor web script differences for the “client extensions” projects and LDAP configuration differences for the “server extension” project.

With the release of 3.2 Community, I realized:

  • The number of different flavors of Alfresco any given reader might be running are going up, not down. Who knows when 2.2 Enterprise will be sunset.
  • It is no longer as easy as “Enterprise” versus “Labs/Community” because multiple releases of the same flavor are prevalent (2.2E, 3.0E, and 3.1E, for example).
  • Tagging my code in Subversion by Chapter alone is no longer enough–I need to tag by Chapter and by Alfresco version.
  • Sending the publisher the code one chapter at-a-time and expecting them to manage updates and deciding how to organize all of the chapter code was a bad idea.

So, I’ve done some work to make this better (reorg the projects, restructure the download files). I’ve also tested the example code from each chapter against the latest service packs for all releases since 2.2 Enterprise. That includes making some small updates to get the examples running on 3.2 Community.

You can now download either all of the source for every version I tested against, or, download the source that works for a specific version. It may take the official download site at Packt a while to get the new files, so here are links to download them from my site:

Alfresco Developer Guide example source code for…

  • Alfresco 2.2 Enterprise (~5.3 MB, Download)
  • Alfresco 3.0 Labs (~5.6 MB, Download)
  • Alfresco 3.0 Enterprise (~5.7 MB, Download)
  • Alfresco 3.1 Enterprise (~5.6 MB, Download)
  • Alfresco 3.2 Community (~5.7 MB, Download)
  • Alfresco 3.2 Enterprise (~5.9 MB, Download)
  • All of the above, combined (~28.1 MB, Download)

Hopefully this makes it easier for you to grab only what you need and makes it clear that each Eclipse project contains only what’s needed to work with that version of Alfresco. Deployment is easier too. Most of the time, it’s just the “someco-client-extensions” project that you deploy.

Now that I’ve got everything structured like I want it, as new versions of Alfresco are released, it should be much easier to keep up.

Notes from OSCON 2009 in San Jose

I’m back from San Jose. My colleage, Dave Gynn, and I had fun at the O’Reilly Open Source Conference (OSCON) and learned a lot. Dave’s ability to pick out open source rockstars from a crowd is uncanny. It was pretty sweet seeing Larry Wall (and his family) hanging out and then hearing him speak. Although there are all kinds of topics on all things Open Source, the conference does have a heavy Perl bias.

Dave and I decided we were glad we went but we don’t feel like we have to be there every year going forward. This was my first time, but Dave said the general excitement level seemed low for some reason. Maybe it was Allison Randal’s seriously downbeat welcome address. Not sure. Anyway, here are my rough notes from some of the sessions I attended…

“Open Source in Government” was a big theme at OSCON this year. Speakers tried to instill a sense of urgency in the audience by saying that the window of opportunity for getting the government behind open source in a big way will only be open for a few more months. If you want to get involved, check out some of these links:

Data.gov mash-up contest
http://sunlightlabs.com/contests/appsforamerica2/

Machine readable datasets from the US Govt
http://www.data.gov/

Help the government make better use of open source
http://www.opensourceforamerica.org/

Some folks from Liferay presented on a new UI framework they’ve created called Alloy. Alloy is aimed at providing a single framework that addresses HTML, CSS, and JavaScript in a way that is abstracted from the underlying libraries. Alloy basically extends/subclasses JQuery and YUI. Liferay is migrating a lot of their OOTB portlets now to the new framework. It is expected to ship as part of 5.3. This talk was more about the “why” and less about the “what”. I would have liked to see more examples/demos.

Went to a talk on “using Django for election audits” that turned out to be more about how screwed up our elections process is and the minutiae of performing an audit on election results with not so much on how Django was used to solve the problem. The speaker did give a shout out to the Django Debug Toolbar that might prove to be useful. The presenter is looking for help with the project. He needs everything from UI help to people who can send him election results from their local election boards.

Saw a decent talk on Apache CouchDB. Couch is a schema-less database that is built for massive distributed scalability. Instead of SQL you use map-reduce functions to query. Key to Couch is the concept of “eventual consistency”–in a Couch app, data can be consistent over time instead of right now. Couch always knows either the correct old value or the correct current value, but it may take time to propogate the current value to every node in the system.

Noteworthy bullet points:

  • Couch can idle in 4MB of RAM. With a couple of production databases Couch will use about 20MB.
  • Canonical is including Couch in the Karmic Koala release. This will give apps running on Karmic the ability to easily sync data between nodes. Couch will also be running as part of Ubuntu One which means Karmic desktops can sync data with the Ubuntu cloud (See the Ubuntu wiki).
  • Someone is currently working on a JavaScript implementation of Couch. Among other things, this would give you the ability to replicate your CouchDB to a local version of Couch running in someone’s browser.
  • Current ACL is limited to “you are either an admin or you aren’t”. ACL for writers *might* make it into 1.0. ACL for readers won’t.

I went to the “JRuby on AppEngine” talk not for the JRuby, but because it was the only Google AppEngine session I could find. I was looking for some factoids on who’s using AppEngine. Here’s what they said:

  • 200,000 registered developers
  • 85,000 applications
  • Household names such as: eBay, Best Buy, Forbes, Whitehouse.gov.

Whitehouse.gov was a cool scalability story for AppEngine. They used AppEngine to moderate questions submitted during Obama’s first online town hall. According to the Google Code blog,

“During the 48-hour open voting period, the site peaked at 700 hits per second, and 92,934 people submitted 104,073 questions and cast 3,605,984 votes. In total, over one million unique visitors visited the site before the town hall. Even while the site was featured on major news outlets and even the Google homepage the other 50,000 apps built on App Engine were fully supported and experienced no adverse effects.”

The Erlang talk provided a good history of the language. I would have liked more on the language itself and less of the detailed history behind Ericsson’s telecom switches (even though Erlang played a critical role in those products). I was aware that CouchDB is built with Erlang but the speaker mentioned a couple of other open source projects that leverage Erlang that I hadn’t heard of: ejabberd is an Erlang-based chat server and RabbitMQ is an Erlang-based messaging server.

The “building a business on an open source distributed cloud” talk by Bradford Stephens was good. The speaker’s company, Visible Technologies, mines social networks and the internet in general for consumer sentiment on its customer’s brands. Their system ingests vast subsets of the Internet, parses the results, processes it, and indexes it so that they can run analytics against it for their clients. They moved from an all-Microsoft stack to an open source stack and have been very happy with it.

This was the third “noSQL”-themed talk I saw. He made a good point that when we design apps, we should be saying, “I need persistence” and then figure out what is the best provider of that given scalability and other constraints rather than starting out with “I need a relational database”.

The open source stack used by Visible Technologies includes the usual search players (Lucene, Nutch, Solr) as well as one I haven’t heard of: Katta is used to shard large Lucene indexes across multiple servers. They also use a couple of Hadoop sub-projects, HBase and ZooKeeper, and several others.

The New York Times API and NPR API talks were very good. I didn’t realize how many different API’s NYT has exposed. You can check out their API’s around people, news, search, movies, and books at http://developer.nytimes.com. Their blog is also worth checking out.

Lots of apps have been built using the NYT API. A personal favorite is InstantWatcher. It is a mash-up of NYT’s movies API with Netflix that helps you find good movies available to watch instantly.

NPR’s talk focused less on their specific API and more on how it is being used. Noteworthy bullets:

  • You can build API calls with their query generator (requires a free API key) or by hand (doc).
  • NPR offers tiered key levels. If you create something cool and drive a little traffic their way, you can get your key upgraded to a higher tier.
  • There are no rate limits. NPR believes they have built an infrastructure that can take “anything we can throw at it”.
  • The API has 2,000 users and serves 24 million requests (per ?) averaging 2 million requests per month.
  • 50% of the API requests are for NPRML with less than 0.1% requesting ATOM. NPR API results are also available as JSON, RSS, and several other formats.
  • The NPR Digital Media team blogs at http://www.npr.org/blogs/inside/
  • Interesting side-note: NPR is currently migrating off of Oracle 10g to MySQL

After the NYT and NPR talks, they held a developer meet-up of sorts. Unfortunately I had to head to the airport so I missed out on that.

ECM vendors have their heads in the cloud, can you see through the fog?

The hype around cloud computing has reached a fevered pitch so it is natural that ECM vendors try to take advantage of that as much as they can. Some examples from the open source ECM world:

  • Alfresco always seems to be partnering with one cloud vendor or another. I went to a brief session on Alfresco, GoGrid, and ParaScale earlier this year. (As an aside, those GoGrid cycling socks, which I thought was a strange giveaway at the time, are awesome).
  • At the end of last year eZ Publish announced a partnership with Mamut to provide eZ as SaaS.
  • Just last week Nuxeo announced a cloud edition of its product.

Clearly, ECM vendors are busy figuring out how to take advantage of the cloud. But what does it mean for ECM to be “in the cloud”? When might it work for you?

Cirrus, Stratus, or Cumulonimbus

The first thing you need to realize is that when people say “cloud” they often mean very different things. Generally, there are three types of clouds: Software-as-a-Service (Saas), Platform-as-a-Service (PaaS), and Infrastructure-as-a-Service (IaaS).

Software-as-a-Service (SaaS) is the same model that’s been around for years but has lately taken advantage of the cloud moniker. Google Apps and Salesforce.com are the big SaaS players but there are SaaS offerings for all kinds of business applications, including content management.

The allure of SaaS ECM is the same as that of SaaS in general:

  • Lower up-front costs
  • Someone else gets to worry about running and scaling the infrastructure
  • Depending on the vendor, you may only have to pay for what you use

The challenges of SaaS ECM include things like:

  • The ability to do heavy customization and complex workflows
  • Ease of integration with other systems
  • Client perceptions (and real issues) around data security
  • Data portability/vendor lock-in

Open Source CM vendors Nuxeo and eZ Systems have SaaS offerings as do proprietary vendors such as SpringCM, CrownPeak, Clickability, and PaperThin, to name a few. Beyond just general-purpose document and content management, I think you’ll also see vendors build verticalized SaaS offerings on top of hosted content management technology.

The next type of cloud is Platform-as-a-Service (PaaS). The two best examples of PaaS are Google App Engine (GAE) and Salesforce.com’s force.com platform. With PaaS, you provide the code and the PaaS provider does the rest. Of course this means your code has to follow certain standards and is often subject to limitations, but the beauty is that you get a completely custom solution without worrying about any of the infrastructure.

I like GAE. For certain applications, the benefits of instantaneous, global scale far outweigh the limitations of the platform. But I don’t expect ECM vendors that would do well in SaaS or IaaS clouds to do much with PaaS. You can’t take an Alfresco or a Drupal and run it on a PaaS cloud. I do think we will see PaaS-native content management systems. For example, I’ve seen apps in the Salesforce.com AppExchange that are basically tools for building a web site that’s tightly integrated with Salesforce.com. I think you’ll also see solutions that leverage a PaaS for certain components or sub-systems.

The third type of cloud is Infrastructure-as-a-Service (IaaS). An IaaS cloud is about providing virtual servers on-demand. Examples include things like Amazon’s EC2, Rackspace Cloud, and GoGrid. With these services you can instantly provision as many servers as you need. What you do with them is up to you. When you’re done, you turn them off. Specifics vary but you are essentially billed for CPU time.

The way people leverage IaaS differs. Some people will provision a server and install their ECM software of choice and stop there. Other than dealing with different file storage approaches of various IaaS vendors, this is really no different than running your own virtual servers. So when someone says they are running XYZ CMS “in the cloud” and it turns out to be a single node on a virtual machine, I can barely stifle a yawn. It’s fast and convenient to set up, yes, but technically it’s pretty boring.

The more interesting way to use ECM in an IaaS cloud is to leverage the ability of the infrastructure to scale on-demand. That’s the real value of “the cloud” after all. For example, at Optaros we run an IaaS-hosted solution called OView that syndicates content and content-centric applications to web sites. When a client places that content or app on Yahoo’s home page we get a huge spike in traffic. We run the solution on Amazon EC2 images and we use RightScale to dynamically provision additional nodes when traffic warrants.

The degree to which a specific ECM vendor can operate in a dynamically-scaled infrastructure varies greatly. Simply “running in the cloud” is easy. Scaling your ECM infrastructure automagically is harder.

What do you really need?

If the list of SaaS benefits have a lot of appeal to you and the challenges and potential limitations aren’t much of a bother, SaaS ECM might be worth evaluating. This will most likely be a better fit for clients with limited IT resources and simple to moderate requirements around ECM.

On the IaaS front, if it is just an issue of externally-hosting your ECM infrastructure, make sure the cloud is what you want. The best use case for the cloud is when demand is temporary or unpredictable with huge spikes. I would argue that for your core ECM infrastructure demand is neither temporary nor unpredictable.

If “scale” is your issue, I would challenge you to think about exactly what needs to be scaled. If it is just content delivery of static content, maybe you could get by with a CDN. If your content management system can separate authoring from dynamic delivery of content, maybe only the dynamic content delivery mechanism needs to be able to scale quickly.

You might have certain processes (large-scale video transcoding, for example, or other types of periodic batch processing) that you could leverage the cloud for without cloud-enabling your entire ECM infrastructure. Acquia‘s hosted spam filtering service, Mollum, and their newly-released hosted-search offering are two examples where only specific pieces of your infrastructure are off-loaded to the cloud.

If it turns out that you need to scale the whole ball of wax, fine, it can be done, but have a good reason.

ECM in the cloud is, um, cloudy

The cloud as a style of computing is exciting. The cloud as a “feature” is potentially confusing. ECM vendors are going to do what they can do have it somewhere “on the box”. But it’s not something you can simply check off. The next time you hear an ECM vendor say, “cloud-ready”, ask them what they mean. Then figure out whether or not that has any relevance at all to your real requirements.

Is the cloud on your horizon? Let me know if/how the cloud relates to your ECM strategy.