Tag: AWS

ECM: You Ain’t Gonna Need It

Sometimes I feel like I spend as much time telling people why they don’t need an Enterprise Content Management (ECM) platform as I do helping people implement one. Today’s blog post by my friends over at TSG underscores another use case where a full-blown ECM platform may be overkill: Serving as a repository for huge volumes of files.

Their blog post says they are loading 20,000 documents per second into DynamoDB with a goal of getting up to 11 billion documents. The actual files are stored in S3, and that load rate does not include uploading files to S3 buckets, so this doesn’t exactly mimic a real-world bulk document import scenario, but that’s not what TSG was trying to test.

TSG correctly points out that it is the metadata repository, which legacy vendors often base on relational databases, that struggles in high-volume implementations, so their test focuses on the ability of Dynamo to store a high-volume of data while maintaining performance.

Let me shift slightly from case management, which is what TSG focuses on, to the more general problem firms have when they generate 100’s of millions of files that they need to manage and deliver to their stakeholders.

I have seen multiple clients and prospects who have very demanding requirements in terms of data volume while at the same time requiring very little in terms of what I’d call traditional ECM functionality. Often, they need nothing more than a RESTful API to get data into and out of the repository and some basic searching across a few metadata fields. Insurance companies and financial services companies are two examples of industries with lots of such use cases.

TSG is using native AWS services to manage metadata (DynamoDB) and file storage (S3), but this can also be done on-premises leveraging either commercial or open source solutions to provide a nearly-infinite scale, highly-performant, fully redundant solution.

The important part here is that you do not need an ECM platform to do this. In fact, many high-profile customers of legacy ECM vendors like Documentum, Alfresco, and FileNet, are actively moving away from those platforms for use cases like these.

Why? Companies tell me those platforms are proving too difficult to scale, too complex to implement and maintain, and too expensive in terms of licensing cost.

My company, Metaversant, can build you a minimal content management system in your own data center or in AWS. We call our solution Magritte. It includes:

  • Scalable and distributed object storage built on commodity disk drives
  • Flexible content model
  • REST API for CRUD functions
  • Basic permissions on objects
  • Metadata search (and full-text search if you need it)
  • Ability to manage billions of objects
  • $0 in mandatory licensing costs (commercial support of underlying open source components is available at the customer’s option)

That’s enough functionality for many use cases. In fact it seems like the exact right amount of functionality for managing things like customer statements, contracts, and agreements in large insurance or financial services companies.

What legacy ECM platforms include–to be sure, for a cost, in terms of both real dollars and complexity–are things like:

  • Support not for just object storage but also additional file system types (Glacier, NAS, SAN, EMC Centera)
  • Extensible, formal (schema-based) content model
  • REST API for every aspect of the platform
  • Foundational/native APIs, client libraries, or SDKs
  • Complex, fine-grained access control lists, including ability to support inheritance and “deny”
  • Extensible platform (hooks where developers can add code to alter or enhance the platform functionality)
  • Support for additional file protocols such as FTP, SMB, IMAP, SMTP, WebDAV
  • Support for content repository standards such as JCR and CMIS
  • Transformation engine (generates previews and thumbnails)
  • Workflow engine
  • Rules engine
  • Analytics, reporting, & dashboards
  • Integrations with third-party systems such as Outlook, SAP, Salesforce, Google Docs, Box, Dropbox
  • Web-based user interface
  • Application components/framework for extending the web UI or for building new custom web UIs
  • Forms engine
  • Mobile applications
  • Desktop Sync
  • Business-specific applications (Records Management, Media Management, Reporting)

Can those traditional ECM features be added to a minimal content management solution like Magritte? Of course. And if you add enough of them you might be better off with a legacy ECM vendor (assuming you can get it to scale to meet your needs, which may be a big assumption depending on your operational constraints).

But if you don’t need a lot or any of those additional features, why start out implementing and paying for an entire aircraft carrier if what you really need is a speedboat?

In software there’s a phrase, “You Ain’t Gonna Need It”, which aims to defer development of features until they are actually needed instead of developing them now for some future need, which may never materialize. In ECM, you might take on the complexity of an entire platform because the “E” in “Enterprise” makes you think you are implementing something the entire company will leverage. That hasn’t panned out–just look at how many companies have multiple so-called “Enterprise” Content Management systems.

Instead of continuing to install these giant platforms, most of which go unused, let’s implement right-sized solutions on top of clusterable, scalable, open source components that talk to each other via API’s. ECM: You Ain’t Gonna Need It.

(Updated 6/3/2020 to fix a minor typo)

Have you tried the serverless framework?

Last year I was working on a POC. The target stack of the POC was to be 100% native AWS as much as possible. That’s when I came across Serverless. Back then it was still in beta, but I was really happy with it. After the POC was over I moved on to other things. A couple of days ago I was reminded how useful the framework is, so I thought I’d share some of those thoughts here.

Before I continue, a few words about the term, “serverless”. In short, it gets some folks riled up. I don’t want to debate whether or not it’s a useful term. What I like about the concept is that, as a developer, I can focus on my implementation details without worrying as much about the infrastructure the code is running on. In a “serverless” setup, my implementation is broken down into discrete functions that get instantiated and executed when invoked. Of course, there are servers somewhere, but I don’t have to give them a moment’s thought (nor do I have to pay to keep them running, at least not directly).

If your infrastructure provider of choice is AWS, functions run as part of a service offering called Lambda. If you want to expose those functions as RESTful endpoints, you can use the AWS API Gateway. Of course your Lambda functions can make calls to other AWS services such as Dynamo DB, S3, Simple Queue Service, and so on. For my POC, I leveraged all of those. And that’s where the serverless framework really comes in handy.

Anyone that has done anything with AWS knows it can often take a lot of clicks to get everything set up right. The serverless framework makes that easier by allowing me to declare my service, the functions that make up that service, and the resources those functions leverage, all in an easy-to-edit YAML file. Once you get that configuration done, you just tell serverless to deploy it, and it takes care of the rest.

Let’s say you want to create a simple service that returns some JSON. Serverless supports multiple languages including JavaScript, Python, and Java, but for now I’ll do a JavaScript example.

First, I’ll bootstrap the project:

serverless create --template aws-nodejs --path echo-service

The serverless framework creates a serverless.yml file and a sample function in handler.js that echoes back a lot of information about the request. It’s ready to deploy as-is. So, to test it out, I’ll deploy it with:

serverless deploy -v

Behind the scenes, the framework creates a cloud formation template and makes the AWS calls necessary to set everything up on the AWS side. This requires your AWS credentials to be configured, but that’s a one-time thing.

When the serverless framework is done deploying the service and its functions, I can invoke the sample function with:

serverless invoke -f hello -l

Which returns:

{
    "statusCode": 200,
    "body": "{\"message\":\"Go Serverless v1.0! Your function executed successfully!\",\"input\":{}}"
}

To invoke that function via a RESTful endpoint, I’ll edit serverless.yml file and add an HTTP event handler, like this:

functions:
  hello:
    handler: handler.hello
    events:
      - http:
          path: hello
          method: get

And then re-deploy:

serverless deploy -v

Now the function can be hit via curl:

curl https://someid999.execute-api.us-east-1.amazonaws.com/dev/hello

In this case, I showed an HTTP event triggering the function, but you can use other events to trigger functions, like when someone uploads something to S3, posts something to an SNS topic, or on a schedule. See the docs for a complete list.

To add additional functions, just edit handler.js and add a new function, then edit serverless.yml to update the list of functions.

Lambda functions cost nothing unless they are executed. AWS offers a generous free tier. Beyond the first million requests in a month it costs $0.20 per million requests (pricing).

I should also mention that if AWS is not your preferred provider, serverless also works with Azure, IBM, and Google.

Regardless of where you want to run it, if you’ve got 15 minutes you should definitely take a look at Serverless.