CMIS: An open API for managing content

Most of the content in a company is completely unstructured. Just think about the documents you collaborate on with the rest of your team throughout the day. They might include things like proposals, architecture diagrams, presentations, invoices, screenshots, videos, books, meeting notes, or pictures from your last company get-together.

How does a company organize all of that content? Often it is scattered across file shares and employee hard drives. It isn’t really organized at all. It’s hard enough to simply find content in that environment, but what about answering questions like:

  • Is this the latest version and how has it changed over time?
  • Which customer is this document related to?
  • Who is allowed to read or make changes to this document?
  • How long are we legally required to keep this document?
  • When I’m done making my change to this document, what is the next step in the process?

To address this, companies will often write content-centric applications that try to put some order to the chaos. But most of our content resides in files, and files can be a pain to work with. Databases can store files up to a certain file size, but they aren’t great for working with audio and video. File systems solve that problem but they alone don’t offer rich functionality like the ability to track complex metadata with each file or the ability to easily full-text index and then run searches across all of your content.

That’s where a content repository comes in. You might hear these referred to as a Document Management (DM) system or an Enterprise Content Management (ECM) system. No matter what you call it, they are purpose-built for making it easier for your company to get a handle on its file-based content.

Here’s the problem for developers, though: There is a lot of repository software out there. Most large companies have more than one up-and-running in their organization, and every one of them has their own API. It’s rare that these systems exist in a vacuum. They often need to feed and consume business processes and that takes code. So if you are an enterprise developer, and you are trying to integrate some of your systems with your ECM repositories, you’ve got multiple API’s you need to learn. Or, if you are a software vendor, and you are trying to build a solution that requires a rich content repository as a back-end, you either have to choose a specific back-end to support or you have to write adapters to support a handful of repositories.

The solution to this problem is called Content Management Interoperability Services (CMIS). It’s an industry-wide specification managed by OASIS. It describes a domain language, a query language, and multiple protocols for working with a content repository. With CMIS, developers write against the CMIS API instead of learning each repository’s proprietary API, and their applications will work with any CMIS-compliant repository.

The first version of the specification became official in May of 2010. The most recent version, 1.1, became official this past May.

Several developers have been busy writing client libraries, server-side libraries, and tools related to CMIS. Many of these are collected as part of an umbrella open source project known as Apache Chemistry (http://chemistry.apache.org). The most active Apache Chemistry sub-project is OpenCMIS. It includes a Java client library (including Android), multiple servers for testing purposes, and some developer tools, such as a Java Swing-based repository browser called OpenCMIS Workbench. Apache Chemistry also includes libraries for Python, .NET, PHP, and Objective-C.

The tools and libraries at Apache Chemistry are a great way to get started with CMIS. For example, I’ve got the Apache Chemistry InMemory Repository deployed to a local Tomcat server. I can fire up OpenCMIS Workbench and connect to the server using its service URL, http://localhost:8080/chemistry/browser. Once I do that I can navigate the repository’s folder hiearchy, inspecting or performing actions against objects along they way.

The OpenCMIS Workbench has a built-in Groovy console. One of the examples that ships with the Workbench is “Execute a Query”. Here’s what it looks like without the imports:

String cql = "SELECT cmis:objectId, cmis:name, cmis:contentStreamLength FROM cmis:document"

ItemIterable<QueryResult> results = session.query(cql, false)

results.each { hit ->
hit.properties.each { println "${it.queryName}: ${it.firstValue}" }
println "--------------------------------------"
}

println "--------------------------------------"
println "Total number: ${results.totalNumItems}"
println "Has more: ${results.hasMoreItems}"
println "--------------------------------------"

The Apache Chemistry OpenCMIS InMemory Repository ships with some sample data so when I execute the Groovy script, I’ll see something like:

cmis:contentStreamLength: 33216
cmis:name: My_Document-0-1
cmis:objectId: 134
--------------------------------------
cmis:contentStreamLength: 33226
cmis:name: My_Document-1-0
cmis:objectId: 130
--------------------------------------
cmis:contentStreamLength: 33718
cmis:name: My_Document-2-0
cmis:objectId: 105
--------------------------------------
cmis:contentStreamLength: 33617
cmis:name: My_Document-2-1
cmis:objectId: 122
--------------------------------------
cmis:contentStreamLength: 33807
cmis:name: My_Document-2-2
cmis:objectId: 129
--------------------------------------
cmis:contentStreamLength: 33364
cmis:name: My_Document-2-1
cmis:objectId: 128
--------------------------------------
cmis:contentStreamLength: 33506
cmis:name: My_Document-2-1
cmis:objectId: 112
--------------------------------------
cmis:contentStreamLength: 33567
cmis:name: My_Document-2-1
cmis:objectId: 106
--------------------------------------
cmis:contentStreamLength: 33230
cmis:name: My_Document-2-2
cmis:objectId: 107
--------------------------------------
cmis:contentStreamLength: 33774
cmis:name: My_Document-1-1
cmis:objectId: 115
--------------------------------------
cmis:contentStreamLength: 33524
cmis:name: My_Document-2-0
cmis:objectId: 121
--------------------------------------
cmis:contentStreamLength: 33593
cmis:name: My_Document-2-0
cmis:objectId: 111
--------------------------------------
cmis:contentStreamLength: 34152
cmis:name: My_Document-2-2
cmis:objectId: 123
--------------------------------------
cmis:contentStreamLength: 33332
cmis:name: My_Document-0-0
cmis:objectId: 133
--------------------------------------
cmis:contentStreamLength: 33478
cmis:name: My_Document-1-2
cmis:objectId: 116
--------------------------------------
cmis:contentStreamLength: 33541
cmis:name: My_Document-1-2
cmis:objectId: 132
--------------------------------------
cmis:contentStreamLength: 33225
cmis:name: My_Document-2-0
cmis:objectId: 127
--------------------------------------
cmis:contentStreamLength: 33333
cmis:name: My_Document-2-2
cmis:objectId: 113
--------------------------------------
cmis:contentStreamLength: 33698
cmis:name: My_Document-1-0
cmis:objectId: 114
--------------------------------------
cmis:contentStreamLength: 33746
cmis:name: My_Document-0-2
cmis:objectId: 135
--------------------------------------
cmis:contentStreamLength: 33455
cmis:name: My_Document-1-1
cmis:objectId: 131
--------------------------------------
--------------------------------------
Total number: 21
Has more: false
--------------------------------------

So that query returned three properties, cmis:objectId, cmis:name, and cmis:contentStreamLength, of every object in the repository that is of type cmis:document. We could have restricted the query further with a where clause that tested specific property values or even the full-text content of the files.

Now I also happen to be running Alfresco, which is an open source ECM repository. The beauty of CMIS is demonstrated by the fact that I can run that exact same Groovy script against Alfresco. I simply have to reconnect using Alfresco’s service URL, which is http://localhost:8080/alfresco/cmisatom (for the Atom Pub binding). My local Alfresco repository has many more objects than my OpenCMIS InMemory repository so I won’t list the output here, but the code runs successfuly unchanged.

Readers who spend their days enjoying standardized SQL that works across databases or the benefits of ORM tools that abstract their code from any specific relational database will no doubt be unimpressed by this feat. But I promise you that those of us who have to work with ECM repositories like SharePoint, Documentum, FileNet, and Alfresco, sometimes all on the same project, are rejoicing.

The next time you need to integrate with an ECM repository, CMIS should definitely be on your radar. I’ve created a list of CMIS resources to help you.

6 comments

  1. Victor says:

    Hello Jeff,

    I’m trying to migrate from a connector based on JSR-170 spec to a CMIS connector. The problem I have is CMIS don’t map the type “d:content”. That’s a problem for me because there are many data in two properties with that type.
    What’s the solution in CMIS? Maybe I have to use rendiitions from now, but how can I migrate all data that are in that properties if from CMIS I can’t access to that properties?

    Thanks!

Comments are closed.