Sometimes I feel like I spend as much time telling people why they don’t need an Enterprise Content Management (ECM) platform as I do helping people implement one. Today’s blog post by my friends over at TSG underscores another use case where a full-blown ECM platform may be overkill: Serving as a repository for huge volumes of files.
Their blog post says they are loading 20,000 documents per second into DynamoDB with a goal of getting up to 11 billion documents. The actual files are stored in S3, and that load rate does not include uploading files to S3 buckets, so this doesn’t exactly mimic a real-world bulk document import scenario, but that’s not what TSG was trying to test.
TSG correctly points out that it is the metadata repository, which legacy vendors often base on relational databases, that struggles in high-volume implementations, so their test focuses on the ability of Dynamo to store a high-volume of data while maintaining performance.
Let me shift slightly from case management, which is what TSG focuses on, to the more general problem firms have when they generate 100’s of millions of files that they need to manage and deliver to their stakeholders.
I have seen multiple clients and prospects who have very demanding requirements in terms of data volume while at the same time requiring very little in terms of what I’d call traditional ECM functionality. Often, they need nothing more than a RESTful API to get data into and out of the repository and some basic searching across a few metadata fields. Insurance companies and financial services companies are two examples of industries with lots of such use cases.
TSG is using native AWS services to manage metadata (DynamoDB) and file storage (S3), but this can also be done on-premises leveraging either commercial or open source solutions to provide a nearly-infinite scale, highly-performant, fully redundant solution.
The important part here is that you do not need an ECM platform to do this. In fact, many high-profile customers of legacy ECM vendors like Documentum, Alfresco, and FileNet, are actively moving away from those platforms for use cases like these.
Why? Companies tell me those platforms are proving too difficult to scale, too complex to implement and maintain, and too expensive in terms of licensing cost.
My company, Metaversant, can build you a minimal content management system in your own data center or in AWS. We call our solution Magritte. It includes:
- Scalable and distributed object storage built on commodity disk drives
- Flexible content model
- REST API for CRUD functions
- Basic permissions on objects
- Metadata search (and full-text search if you need it)
- Ability to manage billions of objects
- $0 in mandatory licensing costs (commercial support of underlying open source components is available at the customer’s option)
That’s enough functionality for many use cases. In fact it seems like the exact right amount of functionality for managing things like customer statements, contracts, and agreements in large insurance or financial services companies.
What legacy ECM platforms include–to be sure, for a cost, in terms of both real dollars and complexity–are things like:
- Support not for just object storage but also additional file system types (Glacier, NAS, SAN, EMC Centera)
- Extensible, formal (schema-based) content model
- REST API for every aspect of the platform
- Foundational/native APIs, client libraries, or SDKs
- Complex, fine-grained access control lists, including ability to support inheritance and “deny”
- Extensible platform (hooks where developers can add code to alter or enhance the platform functionality)
- Support for additional file protocols such as FTP, SMB, IMAP, SMTP, WebDAV
- Support for content repository standards such as JCR and CMIS
- Transformation engine (generates previews and thumbnails)
- Workflow engine
- Rules engine
- Analytics, reporting, & dashboards
- Integrations with third-party systems such as Outlook, SAP, Salesforce, Google Docs, Box, Dropbox
- Web-based user interface
- Application components/framework for extending the web UI or for building new custom web UIs
- Forms engine
- Mobile applications
- Desktop Sync
- Business-specific applications (Records Management, Media Management, Reporting)
Can those traditional ECM features be added to a minimal content management solution like Magritte? Of course. And if you add enough of them you might be better off with a legacy ECM vendor (assuming you can get it to scale to meet your needs, which may be a big assumption depending on your operational constraints).
But if you don’t need a lot or any of those additional features, why start out implementing and paying for an entire aircraft carrier if what you really need is a speedboat?
In software there’s a phrase, “You Ain’t Gonna Need It”, which aims to defer development of features until they are actually needed instead of developing them now for some future need, which may never materialize. In ECM, you might take on the complexity of an entire platform because the “E” in “Enterprise” makes you think you are implementing something the entire company will leverage. That hasn’t panned out–just look at how many companies have multiple so-called “Enterprise” Content Management systems.
Instead of continuing to install these giant platforms, most of which go unused, let’s implement right-sized solutions on top of clusterable, scalable, open source components that talk to each other via API’s. ECM: You Ain’t Gonna Need It.
(Updated 6/3/2020 to fix a minor typo)
Hi Jeff, it’s an interesting article that made me think about the issues. I am not sure your argument is much different from the “build vs buy” conversation that has been around for many years. Perhaps the discussion is now more a “aws vs buy” conversation?
Its relatively easy to create something using various AWS services these days, but to quote, Alan Pelz-Sharpe “Great – you can hold a billion documents in the repository, what can you do with those documents?”
Anyone considering a “content repo” needs to consider the long-term. What would such a system look like in 5 years, 10 or 20 years time? Perhaps it would have many of the features you listed under “legacy platforms”? What will be the total cost of ownership? For a custom solution, who will maintain it in 10 years time? I have plenty of experience where developers leave and no-one wants to touch a 10 year old system no one understands. Perhaps, instead of paying licence fees for maintenance/bug fixes/new features/security patches, customers can pay consultancies on long 10 – 15 year contracts to maintain their custom ecm solutions? Personally, I prefer open solutions, adopting a broad range of technologies with broad standards.
A recent tweet said “Creating things is relatively easy. Maintaining things is much harder. Deprecating things is even more challenging. And completely retiring something or shutting it down is even harder.” https://twitter.com/davidbrunelle/status/1125418204304560131
A “content repo” is not like your typical Javascript framework, ECM is here today, it will be around for many years after we finish our careers. When a company invests in a content repository, they are in it for the long term – creating an initial MVP is the easy bit.
Hi Gethin! Thanks for reading the post and for your comment.
I agree that there is always a build versus buy decision that customers have to make. My point is that if your requirements don’t need all of the features of a full-blown ECM platform, it makes no sense to buy one. So in a build versus buy analysis, customers have to be clear about what it is they are building versus buying.
I would argue that a custom, minimal “repository” assembled from open source components and some glue code will be much easier and less costly to maintain over 10 years than a full-blown ECM platform. An ECM platform is complex. It requires niche resources with deep experience. Those resources are much more expensive than those required to maintain a vastly less complex, albeit custom, code base.
I noticed you didn’t address the scalability concern. In addition to the custom solution being cheaper to maintain, my blog posts also asserts that it will be vastly more scalable than traditional solutions. I would add that not only is it likely to be more scalable, but that the scale comes at a much lower cost as compared to much larger, much more complex systems such as those found in a full-blown ECM platform.
You are correct that ECM will be around for a long time. And I have many clients who are happy with the value their ECM platform is providing. I just think those platforms have grown in size, complexity, and cost to a point which makes smaller, more focused solutions a compelling alternative for some use cases.
Hallo Jeff,
would you call a solution based on cherry picking in Alfresco Community a minimal repository assembled from open source components?
With regards to scalability, I assume you consider the relational nature of Alfresco metadata storage the potential bottleneck? At what scale will that actually start to matter? Gabs 1B document presentation from 2015 suggests that the relational model can get you very far. I’d be very interested about experiences from people hitting a brickwall.
Hi Jeff, you said I didn’t address “the scalability concern” and you also “asserts that [your solution] will be vastly more scalable than traditional solutions”. What do you base this on?
A comprehensive benchmark would need to move from “can AWS ingest x billion documents” to testing real useful features. i.e. what is the performance of full text search on those documents, how do conversions scale with concurrency and volume? If you want to bulk update the metadata on a million documents how do you do that?
What I am saying is, Nuxeo has completed many demanding benchmarks for potential customers. These are real and comprehensive benchmarks that customers have asked for before committing to the product.
I have heard Dynamodb is quite rigid and hard to tune – perhaps it would be better to use a solution based on open source solutions like MongoDB, Elasticsearch and Kafka? You could build that solution yourself or just use Nuxeo which supports these out-of-the-box.
My assertion is based on what I’ve seen at clients who have tried to scale traditional ECM platforms as well as conversations with others who have done the same.
I also agree that benchmarks and tests are valuable ways to validate that specific functionality like the examples you provided are able to scale to meet requirements. I’m glad that Nuxeo is continuing to run these sorts of benchmarks for their customers.
I like Nuxeo. But the point is that there are some use cases where something like Nuxeo would be overkill. Can we agree that there are scenarios where something custom built on MongoDB, Elasticsearch, and Kafka, for example, would be preferable to a full ECM platform?
No, building a solution on Alfresco Community would not be the right component for something like this. It’s too coarse-grained and does not support clustering.