Tag: Cloud

Storj.io: An open source, massively distributed object store and API for developers

I’ve been playing with a new object storage solution that’s kind of cool. It’s called Storj. Before I describe how it works, let me start by comparing it to a more familiar solution.

Probably the best-known example of object storage is Amazon S3. It allows you to define buckets and then upload files into those buckets. Amazon charges you based on the amount you store and the amount you transfer, plus a little based on the total number of objects stored. There are three tiers of storage based on frequency of access and pricing varies by region, but for discussion purposes let’s say it is about $0.023 per GB per month. To store 500 GB that would cost about $138 per year without transfer fees.

For that $138 you can be sure that Amazon is replicating your data across multiple facilities and devices. Amazon says that S3 offers 99.999999999% durability. That’s pretty impressive.

But one consideration with using S3 or any other traditional cloud storage solution is that your data is sitting in data centers owned by a single vendor. Of course you could take steps to replicate that data to other providers, but that is kind of a pain. Even then you will still end up with your data sitting behind a relatively small number of vendors, none of whom are really geared toward transparency and openness.

Storj.io was built to address this problem. It’s an open source, distributed object storage platform. Like S3, the model consists of buckets and files in those buckets. The difference is in how your data is stored. When you upload a file to Storj, your file is broken into small pieces called shards, encrypted using keys you hold, and then uploaded to several nodes around the world.

Here’s where it gets really interesting. The nodes that store data are not owned by a single entity. Instead, nodes are run by “storage farmers”. Disk farming is kind of like crypto currency farming, but instead of solving mathematical computations to earn crypto coins, farmers receive micro payments based on how much their space gets utilized. Storj actually leverages the Ethereum blockchain to make this work, and if you are interested in the nitty gritty details, you should check out the whitepaper.

A farmer might be an individual with 50 GB of spare disk space or it could be an organization with lots and lots of space. You don’t know and you don’t really care. Their space gets selected based on a number of factors, which includes things like stability of the node, bandwidth, and total space available. If a farmer tries to tamper with any data on their node they get dropped and they don’t get paid.

Right now Storj is offering 25 GB of free space for one year. After that, their current pricing is $.015 per GB per month. So using my 500 GB example, that’s $90 per year without transfer fees. And if you have some extra storage sitting around, you could become a farmer and offset your costs a little bit.

To be clear, Storj is a tool for developers. After signing up you’ll get presented with a GUI for creating buckets, but when it’s time to start moving data into those buckets you’ll need the API. Right now there is a NodeJS library or you can use a command-line tool provided by a native installer.

This service definitely looks promising, but it is important to know that it is still early. One thing to think about is what happens if farmers start dropping out of the network. When your file is split into shards, each shard is copied to multiple nodes. In my quick test, shards were spread across five nodes, which is plenty to give me confidence that I will be able to get my file back.

If a node drops offline, it is supposed to trigger a replication of your shard to another farmer using one of the remaining good nodes. This works great unless all of the farmers who hold one of your shards drop at once, but with 19,000 farmers and climbing, and assuming your shards are always on multiple nodes, the chances of that happening seem very, very low. The docs say that Storj is working on rolling out additional mirroring strategies. And, you can always use the API to ask Storj which nodes your file is sharded across. It looks like you can make an API call to move a shard yourself, but I haven’t tried that yet.

One last thing to point out is that this is an open source project. You are welcome to contribute. You can even grab the software and run a completely private Storj network, if you want.

I feel like some of my clients are still getting used to the idea of putting their data in the cloud. And some like one throat to choke. A distributed cloud like this may be a tougher sell for conservative customers, even if the security and the durability are there. Still, I love the concept. What do you think?

When to consider Cloud CMS for your content management project

Cloud CMS LogoCloud CMS announced today that it has added support for CMIS. This is a nice addition for all sorts of reasons, but near the top from Cloud CMS’s perspective is that it makes it easier to migrate content from existing solutions into the Cloud CMS repository.

Back in November I did a series of reviews on content-as-a-service providers. One of my posts was on Cloud CMS. The post assumes you are looking for hosted content-as-a-service and shows how Cloud CMS compares to other cloud offerings.

What I think we’re going start seeing more and more, however, are people who might consider Cloud CMS as an alternative to traditional on-premises ECM vendors like Alfresco, Nuxeo, Documentum, and Microsoft. Although Cloud CMS was originally built to be a hosted, content-centric, back-end for mobile and web applications, it can just as easily function as your hosted intranet or document management repository.

With custom content models, event triggers, and custom workflows, you may find that the only difference between Cloud CMS and your current on-premises document repository is that you don’t have to worry about software or hardware installation and upgrades any longer.

Considering Cloud CMS as an alternative to traditional players may make sense when:

  1. A 100% cloud-native solution is preferred. While Cloud CMS could be run on-premises it is certainly built to be hosted on your behalf. Plus, one of the benefits of letting Cloud CMS run, upgrade, and scale your repository is so you don’t have to.
  2. Customization is important. Some of the traditional vendors have made cloud add-ons for their products, but they then lock down the content model and the user interface so that it cannot be customized. Cloud CMS offers the benefits of hassle-free operations while maintaining your ability to customize it to meet your exact requirements.
  3. Budget is constrained. Clients who need enterprise-grade features and the peace of mind that a support contract brings but can’t justify the high cost of a traditional vendor’s enterprise license may find Cloud CMS to be a lower-cost alternative. Rather than licensing by the seat or the server, Cloud CMS cost is based on how and how much you use the system.

Clients who have very straightforward needs (simple file sync and share, for example) will probably choose something a little more utilitarian, like Google Drive, Box, Dropbox, or Amazon Zocalo. And, despite Cloud CMS recently having undergone an extensive security audit, I know some clients may still be reluctant to move to the cloud. Everyone else, though, should take a hard look at Cloud CMS.

 

Alfresco, NOSQL, and the Future of ECM

Alfresco wants to be a best-in-class repository for you to build your content-centric applications on top of. Interest in NOSQL repositories seems to be growing, with many large well-known sites choosing non-relational back-ends. Are Alfresco (and, more generally, nearly all ECM and WCM vendors) on a collision course with NOSQL?

First, let’s look at what Alfresco’s been up to lately. Over the last year or so, Alfresco has been shifting to a “we’re for developers” strategy in several ways:

  • Repositioning their Web Content Management offering not as a non-technical end-user tool, but as a tool for web application developers
  • Backing off of their mission to squash Microsoft SharePoint, positioning Alfresco Share instead as “good enough” collaboration. (Remember John Newton’s slide showing Microsoft as the Death Star and Alfresco as the Millenium Falcon? I think Han Solo has decided to take the fight elsewhere.)
  • Making Web Scripts, Surf, and Web Studio part of the Spring Framework.
  • Investing heavily in the Content Management Interoperability Services (CMIS) standard. The investment is far-reaching–Alfresco is an active participant in the OASIS specification itself, has historically been first-to-market with their CMIS implementation, and has multiple participants in CMIS-related open source projects such as Apache Chemistry.

They’ve also been making changes to the core product to make it more scalable (“Internet-scalable” is the stated goal). At a high level, they are disaggregating major Alfresco sub-systems so they can be scaled independently and in some cases removing bottlenecks present in the core infrastructure. Here are a few examples. Some of these are in progress and others are still on the roadmap:

  • Migrating away from Hibernate, which Alfresco Engineers say is currently a limiting factor
  • Switching from “Lucene for everything” to “Lucene for full-text and SQL for metadata search”
  • Making Lucene a separate search server process (presumably clusterable)
  • Making OpenOffice, which is used for document transformations, clusterable
  • Hiring Tom Baeyens (JBoss jBPM founder) and starting the Activiti BPMN project (one of their goals is “cloud scalability from the ground, up”)

So for Alfresco it is all about being an internet-scalable repository that is standards-compliant and has a rich toolset that makes it easy for you to use Alfresco as the back-end of your content-centric applications. Hold that thought for a few minutes while we turn our attention to NOSQL for a moment. Then, like a great rug, I’ll tie the whole room together.

NOSQL Stores

A NOSQL (“Not Only SQL”) store is a repository that does not use a relational database for persistence. There are many different flavors (document-oriented, key-value, tabular), and a number of different implementations. I’ll refer mostly to MongoDB and CouchDB in this post, which are two examples of document-oriented stores. In general, NOSQL stores are:

  • Schema-less. Need to add an “author” field to your “article”? Just add it–it’s as easy as setting a property value. The repository doesn’t care that the other articles in your repository don’t have an author field. The repository doesn’t know what an “article” is, for that matter.
  • Eventually consistent instead of guaranteed consistent. At some point, all replicas in a given cluster will be fully up-to-date. If a replica can’t get up-to-date, it will remove itself from the cluster.
  • Easily replicate-able. It’s very easy to instantiate new server nodes and replicate data between them and, in some cases, to horizontally partition the same database across multiple physical nodes (“sharding”).
  • Extremely scalable. These repositories are built for horizontal scaling so you can add as many nodes as you need. See the previous two points.

NOSQL repositories are used in some extremely large implementations (Digg, Facebook, Twitter, Reddit, Shutterfly, Etsy, Foursquare, etc.) for a variety of purposes. But it’s important to note that you don’t have to be a Facebook or a Twitter to realize benefits from this type of back-end. And, although the examples I’ve listed are all consumer-facing, huge-volume web sites, traditional companies are already using these technologies in-house. I should also note that for some of these projects, scaling down is just as important as scaling up–the CouchDB founders talk about running Couch repositories in browsers, cell phones, or other devices.

If you don’t believe this has application inside the firewall, go back in time to the explosive growth of Lotus Notes and Lotus Domino. The Lotus Notes NSF store has similar characteristics to document-centric NOSQL repositories. In fact, Damien Katz, the founder of CouchDB, used to work for Iris Associates, the creators of Lotus Notes. One of the reasons Notes took off was that business users could create form-based applications without involving IT or DBAs. Notes servers could also replicate with each other which made data highly-available, even on networks with high latency and/or low bandwidth between server nodes.

Alfresco & NOSQL

Unlike a full ECM platform like Alfresco, NOSQL repositories are just that–repositories. Like a relational database, there are client tools, API’s, and drivers to manage the data in a NOSQL repository and perform administrative tasks, but it’s up to you to build the business application around it. Setting up a standalone NOSQL repository for a business user and telling them to start managing their content would be like sticking them in front of MySQL and doing the same. But business apps with NOSQL back-ends are being built. For ECM, projects are already underway that integrate existing platforms with these repositories (See the DrupalCon presentation, “MongoDB – Humongous Drupal“, for one example) and entirely new CMS apps have been built specifically to take advantage of NOSQL repositories.

What about Alfresco? People are using Alfresco and NOSQL repositories together already. Peter Monks, together with others, has created a couple of open source projects that extend Alfresco WCM’s deployment mechanism to use CouchDB and MongoDB as endpoints (here and here).

I recently finished up a project for a Metaversant client in which we used Alfresco DM to create, tag, secure, and route content for approval. Once approved, some custom Java actions deploy metadata to MongoDB and files to buckets on Amazon S3. The front-end presentation tier then queries MongoDB for content chunks and metadata and serves up files directly from Amazon S3 or Amazon’s CloudFront CDN as necessary.

In these examples, Alfresco is essentially being used as a front-end to the NOSQL repository. This gives you the scalability and replication features on the Content Delivery tier with workflow, check-in/check-out, an explicit content model, tagging, versioning, and other typical content management features on the Content Management tier.

But why shouldn’t the Content Management tier benefit from the scalability and replication capabilities of a NOSQL repository? And why can’t a NOSQL repository have an end-user focused user interface with integrated workflow, a form service, and other traditional DM/CMS/WCM functionality? It should, it can and they will. NOSQL-native CMS apps will be developed (some already exist). And existing CMS’s will evolve to take advantage of NOSQL back-ends in some form or fashion, similar to the Drupal-on-Mongo example cited earlier.

What does this mean for Alfresco and ECM architecture in general?

Where does that leave Alfresco? It seems their positioning as a developer-focused, “Internet-scale” repository ultimately leads to them competing directly against NOSQL repositories for certain types of applications. The challenge for Alfresco and other ECM players is whether or not they can achieve the kind of scale and replication capabilities NOSQL repositories offer today before NOSQL can catch up with a new breed of Content Management solutions built expressly for a world in which content is everywhere, user and data volumes are huge and unpredictable, and servers come and go automatically as needed to keep up with demand.

If Alfresco and the overwhelming majority of the rest of today’s CMS vendors are able to meet that challenge with their current relational-backed stores, NOSQL simply becomes an implementation choice for CMS vendors. If, however, it turns out that being backed by a NOSQL repository is a requirement for a modern, Internet-scale CMS, we may see a whole new line-up of players in the CMS space before long.

What do you think? Does the fundamental architecture prevalent in today’s CMS offerings have what it takes to manage the web content in an increasingly cloud-based world? Will we see an explosion of NOSQL-native CMS applications and, if so, will those displace today’s relational vendors or will the two live side-by-side, potentially with buyers not even knowing or caring what choice the vendor has made with regard to how the underlying data is persisted?

ECM vendors have their heads in the cloud, can you see through the fog?

The hype around cloud computing has reached a fevered pitch so it is natural that ECM vendors try to take advantage of that as much as they can. Some examples from the open source ECM world:

  • Alfresco always seems to be partnering with one cloud vendor or another. I went to a brief session on Alfresco, GoGrid, and ParaScale earlier this year. (As an aside, those GoGrid cycling socks, which I thought was a strange giveaway at the time, are awesome).
  • At the end of last year eZ Publish announced a partnership with Mamut to provide eZ as SaaS.
  • Just last week Nuxeo announced a cloud edition of its product.

Clearly, ECM vendors are busy figuring out how to take advantage of the cloud. But what does it mean for ECM to be “in the cloud”? When might it work for you?

Cirrus, Stratus, or Cumulonimbus

The first thing you need to realize is that when people say “cloud” they often mean very different things. Generally, there are three types of clouds: Software-as-a-Service (Saas), Platform-as-a-Service (PaaS), and Infrastructure-as-a-Service (IaaS).

Software-as-a-Service (SaaS) is the same model that’s been around for years but has lately taken advantage of the cloud moniker. Google Apps and Salesforce.com are the big SaaS players but there are SaaS offerings for all kinds of business applications, including content management.

The allure of SaaS ECM is the same as that of SaaS in general:

  • Lower up-front costs
  • Someone else gets to worry about running and scaling the infrastructure
  • Depending on the vendor, you may only have to pay for what you use

The challenges of SaaS ECM include things like:

  • The ability to do heavy customization and complex workflows
  • Ease of integration with other systems
  • Client perceptions (and real issues) around data security
  • Data portability/vendor lock-in

Open Source CM vendors Nuxeo and eZ Systems have SaaS offerings as do proprietary vendors such as SpringCM, CrownPeak, Clickability, and PaperThin, to name a few. Beyond just general-purpose document and content management, I think you’ll also see vendors build verticalized SaaS offerings on top of hosted content management technology.

The next type of cloud is Platform-as-a-Service (PaaS). The two best examples of PaaS are Google App Engine (GAE) and Salesforce.com’s force.com platform. With PaaS, you provide the code and the PaaS provider does the rest. Of course this means your code has to follow certain standards and is often subject to limitations, but the beauty is that you get a completely custom solution without worrying about any of the infrastructure.

I like GAE. For certain applications, the benefits of instantaneous, global scale far outweigh the limitations of the platform. But I don’t expect ECM vendors that would do well in SaaS or IaaS clouds to do much with PaaS. You can’t take an Alfresco or a Drupal and run it on a PaaS cloud. I do think we will see PaaS-native content management systems. For example, I’ve seen apps in the Salesforce.com AppExchange that are basically tools for building a web site that’s tightly integrated with Salesforce.com. I think you’ll also see solutions that leverage a PaaS for certain components or sub-systems.

The third type of cloud is Infrastructure-as-a-Service (IaaS). An IaaS cloud is about providing virtual servers on-demand. Examples include things like Amazon’s EC2, Rackspace Cloud, and GoGrid. With these services you can instantly provision as many servers as you need. What you do with them is up to you. When you’re done, you turn them off. Specifics vary but you are essentially billed for CPU time.

The way people leverage IaaS differs. Some people will provision a server and install their ECM software of choice and stop there. Other than dealing with different file storage approaches of various IaaS vendors, this is really no different than running your own virtual servers. So when someone says they are running XYZ CMS “in the cloud” and it turns out to be a single node on a virtual machine, I can barely stifle a yawn. It’s fast and convenient to set up, yes, but technically it’s pretty boring.

The more interesting way to use ECM in an IaaS cloud is to leverage the ability of the infrastructure to scale on-demand. That’s the real value of “the cloud” after all. For example, at Optaros we run an IaaS-hosted solution called OView that syndicates content and content-centric applications to web sites. When a client places that content or app on Yahoo’s home page we get a huge spike in traffic. We run the solution on Amazon EC2 images and we use RightScale to dynamically provision additional nodes when traffic warrants.

The degree to which a specific ECM vendor can operate in a dynamically-scaled infrastructure varies greatly. Simply “running in the cloud” is easy. Scaling your ECM infrastructure automagically is harder.

What do you really need?

If the list of SaaS benefits have a lot of appeal to you and the challenges and potential limitations aren’t much of a bother, SaaS ECM might be worth evaluating. This will most likely be a better fit for clients with limited IT resources and simple to moderate requirements around ECM.

On the IaaS front, if it is just an issue of externally-hosting your ECM infrastructure, make sure the cloud is what you want. The best use case for the cloud is when demand is temporary or unpredictable with huge spikes. I would argue that for your core ECM infrastructure demand is neither temporary nor unpredictable.

If “scale” is your issue, I would challenge you to think about exactly what needs to be scaled. If it is just content delivery of static content, maybe you could get by with a CDN. If your content management system can separate authoring from dynamic delivery of content, maybe only the dynamic content delivery mechanism needs to be able to scale quickly.

You might have certain processes (large-scale video transcoding, for example, or other types of periodic batch processing) that you could leverage the cloud for without cloud-enabling your entire ECM infrastructure. Acquia‘s hosted spam filtering service, Mollum, and their newly-released hosted-search offering are two examples where only specific pieces of your infrastructure are off-loaded to the cloud.

If it turns out that you need to scale the whole ball of wax, fine, it can be done, but have a good reason.

ECM in the cloud is, um, cloudy

The cloud as a style of computing is exciting. The cloud as a “feature” is potentially confusing. ECM vendors are going to do what they can do have it somewhere “on the box”. But it’s not something you can simply check off. The next time you hear an ECM vendor say, “cloud-ready”, ask them what they mean. Then figure out whether or not that has any relevance at all to your real requirements.

Is the cloud on your horizon? Let me know if/how the cloud relates to your ECM strategy.

Google App Engine Now Supports Java

I’ve been playing with the newly-released Java support in Google App Engine and it is pretty cool. You can do more than I expected you could:

  • The Google App Engine Eclipse plug-in gives you a template project and associated config files, Ant build scripts, a deployment tool, and a local run-time environment that acts like GAE (user service, data store, limitations imposed by the platform).
  • You’ve got full persistence and query capability via JDO. You pretty much just model your entities as POJO’s, then you annotate the fields in those classes as “persistent” and you’re good to go. You do JDOQL to query your objects. Queries will only return the first 1000 results.
  • You can run cron jobs. A cron job wakes up on a schedule and invokes a URL you specify.
  • Servlets and JSPs are supported but you can also use things like Struts and Spring (See Will it Work in Google App Engine?).
  • You can take advantage Google’s User service, which means anyone with a Google account can sign-in to your app without creating a new account.
  • You can take advantage of Memcache if you need it (JCache).
  • You can fetch URLs via the URL Fetch service or java.net.URLConnection.
  • You can send mail via JavaMail.
  • You can use their Image service to resize, rotate, flip, and crop images.
  • Both JDK 5 and JDK 6 are supported.

There are some limits:

  • Execution of requests is limited to 30 seconds and that includes URLs invoked by cron jobs.
  • You can’t write to the file system. If you need to write out files, I assume you’d use S3 or something.
  • You can’t open sockets.
  • Each developer can create up to 10 applications and apps can’t be deleted so don’t fill up on Hello Worlds.
  • You can run an app that has up to 500 MB of storage and serves 5 million page views per month at no cost.

The beauty, obviously, is that as a developer, you get to focus on the code and let Google worry about scaling. For many applications, this Platform-as-a-Service (PaaS) will be preferred over Infrastructure-as-a-Service (IaaS). In an IaaS setup, you can use solutions like RightScale to automatically provision new nodes to handle spikes in demand, but you still have to set that up. Plus, you’ve got the additional cost and headache of installing, configuring, and maintaining the application server and database software (and making sure it is set up to work when new nodes are auto-provisioned). With the app engine, scaling globally is pretty simple: Step 1 – Write (Good) Code; Step 2 – Deploy Code to GAE.