Month: September 2012

My New Book: CMIS and Apache Chemistry in Action

I’ve been working on another book project along with Jay Brown from IBM and Florian Mueller from SAP. It’s called “CMIS and Apache Chemistry in Action” and we intend for it to be the definitive guide to CMIS. I’m having a great time working with Jay and Florian who are two guys who have been heavily involved with the CMIS spec and Apache Chemistry from the beginning.

The book is being published by Manning and should be out in April of 2013. Today it has just been made available through the Manning Early Access Program (MEAP). We’re excited about having the book on MEAP because it means you can buy the book today and get the chapters as they are written. This gets drafts of the chapters in your hands quickly so you can apply what you learn to your projects immediately, but, even more importantly, gives you a chance to give us feedback that we can incorporate into the book.

I hope that everyone who wants to write content-centric applications on top of repositories like Alfresco, FileNet, SharePoint, Documentum, and so on, will benefit from the book, whether you are writing those apps in Java, Groovy, Python, PHP, JavaScript, C#, or Objective-C. The book starts out with an intro to CMIS and then moves through a real world example–a CMIS-based music mash-up application–built step-by-step. Once we’ve sufficiently covered the client-side stuff we move on to the server-side for those that need to know how to implement their own CMIS repositories.

The book covers the current 1.0 specification as well as the forthcoming 1.1 version of the specification.

We’ve still got a lot to write, but it feels great to reach the MEAP milestone. I look forward to hearing feedback from all of you as we continue to knock out chapters this Winter.

If you want to buy the book (MEAP or print, when it is available), you can use this code to get 37% off: 12cmisal.

Two oldies but goodies: Screencasts showing Alfresco, Drupal, and Open Atrium

Back when I was at Optaros and CMIS was first showing up in Alfresco in draft form, we developed the Drupal CMIS module. We had a few customers interested in combining the two technologies but I think we were a few years ahead of our time. Now it seems I come across people wanting to combine the two nearly every week in IRC, the forums, or in internal discussions. Alfresco has contributed additional modules for Drupal integration. And multiple partners have full-fledged solutions or service offerings built on top of both.

Optaros has changed their web site a bit since those days and a couple of screencasts I recorded back then have been removed. Because so many people are still interested in this topic, I’ve posted them on YouTube, with Optaros’ permission (thanks!).

So, if you want to see some examples of Drupal and Alfresco working together, here are a couple of screencasts from the archive…

Drupal Plus Alfresco (Original post, 4/7/2009)

Open Atrium Plus Alfresco (Original post, 10/13/2009)

Drupal, Open Atrium, the CMIS modules, and Alfresco have all progressed since then, but the general gist is the same.

Thanks to Optaros for letting me make these available!

Tips on Working with Google Fusion Tables

We had a need to see Alfresco forum users by geography. Google Fusion Tables provides the capability to see any geographic location stored in one or more columns on a map. We had successfully used this before for smaller batches of mostly static data, so I decided to see if it would work well for our forum data. This blog post is about what I did, including some useful tips for working with the Google Fusion Table API.

Determining the Location

First, I needed a city and country for each forum user. In our forums, users can declare their location, but not everyone does. So I wrote a little Python script that uses the MaxMind GeoLite database to determine a location for each user based on IP address. The script then compares the IP-determined location with the user’s declared location, and if they are different, it asks the person running the script to choose which one is likely to be more accurate. For example, the IP address based lookup might come back with “Suriname” but the user’s declared location is “Paramaribo, Suriname”, so you’d choose the latter. The script saves each decision so that it doesn’t have to ask again for the same comparison on this run or subsequent runs.

Loading the Data into Google Fusion Tables with Python

Once I had a city and country for each forum user I had to get those loaded into a Google Fusion Table. I found this Python-based Fusion Tables client and it worked quite nicely.

Here are a few tips that might save you some time when you are working with Google Fusion Tables, regardless of the client-side language…

Don’t Update–Drop, then Add

I started by trying to be smart about updating existing records rather than inserting new ones. But this meant that for each row, I had to do a query to test for the existence of a match and then do an update. This was incredibly slow, especially because you can’t do bulk updates (see next point).

So every time I run an update, the script first clears out the table. That means I load the entire dataset every time there is an update, but that is much faster than the update-if-present-otherwise-insert approach.

Batch Your Queries

The Google Fusion Tables API supports bulk operations. You can execute up to 500 at-a-time, if I recall correctly. This is a huge time-saver. My script just adds the insert statements to a list, and when it gets 500 (or runs out of inserts) it joins the list on “;” and then executes the batch with a single call to the Fusion Tables API.

The one drawback, as mentioned in the previous point is that it does not support bulk updates–only inserts are supported. But with the performance gain of bulk operations, I don’t mind clearing out the table and re-inserting.

Throttle Your Requests

If the script exceeds 30 requests per minute it is highly likely you will get rate-limited. So it is important to throttle your requests. I found that a 2.5 second wait between queries was fine and because the queries are batched 500 at-a-time, it really isn’t a big deal to wait.

Geocoding Takes Time

So the whole thing is pretty slick but there is a small pain. Because all rows get dropped every time I load the table, every row has to be geocoded and that takes time. I believe there is an API call to ask the table to be geocoded but I haven’t found that to work reliably. Instead, I have to go to the table in my browser and tell Fusion Tables to geocode the table. This takes a LONG time. For a table of about 10,000 rows it could easily take 45 minutes or more. At least it is something I can kick off and let run. I only update the table once a month. If it were more often, it would be an issue.

Voila!

That’s it! Thanks to Python and Google Fusion Tables, I now have an interactive map of forum users. Not only is it useful to use interactively, it also lets me run geographic queries against it from Python, such as, “find me the 20 forum users with more than X posts who work within a 20 mile radius of this spot” which can be handy for doing local community outreach.