Clean up your Elasticsearch query logic with search templates

Photo credit: Marcin Wichary

Photo credit: Marcin Wichary

I stumbled onto Elasticsearch’s search templates feature on my last project and it turned out to be really useful. I remember being surprised I hadn’t seen it mentioned anywhere. I’ve asked around at the last couple of meetups I’ve attended and it turns out many people don’t know about search templates, so I thought it might make a good post.

I’m going to give some context as to why this feature was useful, then I’ll show you how to use it. If you don’t want or need the context, feel free to skip to the next section.

Context: Real world use case for search templates

For this particular project we were using Elasticsearch as a content service. A set of front-end Single Page Applications (SPAs) query the content service. The content service returns content objects as JSON that match some criteria. Components in the SPAs format the objects as needed.

The content service is a Java-based API that sits between the SPAs and Elasticsearch. The API abstracts the Elasticsearch details and adds some business logic regarding which content to return beyond simply matching the parameters specified by the front-end.

A simple example of one type of business logic the API adds is publish date and expiration date handling. All of our JSON objects in Elasticsearch have a publish date and some have an expiration date. When the front-end asks the API for content, we only want to return content that is current–in other words the current date has to be greater than or equal to the publish date and less than the expiration date if an expiration date is set.

If you leave this up to the front-end, and if the API is open, then anyone can get any content object, regardless of effectivity, which, in our case, is a bad thing. Even if the API was locked down, there’s no reason to make each front-end application duplicate the date handling logic. So the API handles that and other business rules around fetching content and constructing a response.

The native Elasticsearch API is Java, so building and executing queries in Java is a very natural thing to do. However, as the service evolved, the part of our code responsible for constructing the query was at risk of becoming unwieldy. We also started to identify new types of queries the front-end needed to execute that didn’t fit cleanly into our existing query-building logic.

In addition to identifying new types of queries the API needed to support, we began to see that the front-end applications would need to be able to provide more than just a flat list of key-value pairs–at the very least they would need to ask for content with parameters that included arrays and dictionaries as well.

The service had reached a point where it needed flexibility in the number and type of queries it could run and the parameters it could accept, but we didn’t want to expose the full power (and complexity) of the Elasticsearch Query DSL. Search templates to the rescue.

What is an Elasticsearch Search Template?

(This section contains embedded gists. If you can’t see them you may need to enable JavaScript. If all else fails, the gists live here.)

An Elasticsearch search template is kind of like a stored procedure in a relational database. Really, it’s just a normal query with replacement variables, aka template parameters. The templates are expressed using Mustache.

Here’s a simple example:

That example specified the template and the parameters in the same request. Obviously, if you’re going to do that you might as well not use a template.

What you’d rather do is put the template somewhere and then invoke it. You have two options. You can index the template or you can put the template on the file system.

Here’s how you index a template:

And then you can call it, like this:

If you’d rather put the template on the file system, it goes in $ES_HOME/config/scripts and is named template-id.mustache. Once you’ve deployed the template to every node in your cluster, you can call it, like this:

You don’t have to restart the node when you update a search template. Elasticsearch picks up the changes automatically. If you watch the log when you update a search template you should see something like:

[2016-01-14 18:02:39,797][INFO ][script] [node01] compiling script file [/opt/elasticsearch/jtpcluster01/config/scripts/tweets.mustache]

Including conditionals in your search template

Suppose we want to return all tweets unless a “since” parameter is provided. If since is specified, the query should do a date range against the timestamp property using the value provided. Mustache has some support for conditionals. Here’s how it looks in a search template:

This template will conditionally add the date range check only if the “since” parameter is provided.

Note: Be careful of spacing here. I like to put a space between my curly braces and the parameter. But if you do that in the conditional, mustache won’t recognize it.

To get the tweets for the last 30 days, you’d call the search template like this:

And to get all of the tweets you’d just omit the since parameter.

As your templates get more complex you might take a look at this tool. It allows you to quickly see how your templated queries will render given a set of parameters.

Using negation to implement if-then-else logic

Suppose that instead of returning all tweets we want to return just the last day of tweets unless the since parameter is specified. You’d like to use an if-then-else in the Mustache template. Else isn’t specifically supported by Mustache, but we can use negation to achieve the same thing.

This template keeps the clause that does the date check if since is specified, but now adds a default date check if it is not:

If the date is specified in the since parameter, it works as it did before. If not, only the last day of tweets will be returned.

Working with Arrays

Something that is kind of annoying is how to handle arrays. You can iterate over an array with Mustache fairly easily. But Mustache doesn’t have a mechanism for checking a position in an array such as “isLast” or “hasNext”, so if you need to do something like that, you’ll end up making your own construct.

For example, suppose we want to be able to pass in a list of user names to the search template to restrict the list of tweets to those specific users. The easy way to handle that in our query is to use a terms filter, like this:

{
  "terms": {
    "user": ["jeffpotts01", "elastic"]
  }
}

But that doesn’t let me show how to work with arrays so I’m going to contrive the example to say that if a user list is provided, we need to add an “or” clause to the query with one term filter per user name.

To do that, we’ll require the list of users to be provided as a search template like this:

The template can check for the “userList” parameter to know whether or not to build the “or” clause. Then it can iterate over the “users” array, plucking out the name.

As the template iterates over the array, it needs to know whether or not it is on the last user. Otherwise it has no way of knowing whether or not to add the comma separating the term filters. Mustache can’t help us so the search parameter will include “isLast” set to true for the last user in the list.

Here’s a template that can handle the array of user names:

The result of calling the template above with the example user list is a query that looks like this:

With those simple constructs you ought to be able to create some very elaborate search templates.

Invoking search templates from the Java API

Back in the service layer, it is easy to invoke a search template with the Java API. Here’s how that looks:

In the real API, those params are getting POSTed to the endpoint.

Query changes without a code deployment

With search templates, we can add new queries and modify existing queries by creating and modifying search templates. This means for many adjustments, we don’t have to build and deploy the custom content service API code. And troubleshooting is easier too because we can invoke the same search template the service is using directly and not worry about whether or not the Java API is building the query we expect.

So the next time you find yourself writing code to construct an Elasticsearch query, ask yourself if it would make more sense to externalize it as a search template.

Posted in Elasticsearch, Search | Tagged , , | 7 Comments

Using Hubot and Watcher to automate Elasticsearch admin tasks via chat

hubot-avatar@2xAlmost all of my client work is remote. For many projects, that means chat is an essential communication tool. When you and your team essentially live in a chat window it’s nice when your tools can participate in the conversation. Luckily, it’s pretty easy to wire this up. Let me show you how I did it for a recent Elasticsearch project.

Openfire: An open source chat server

Today, hosted chat services like Slack and HipChat get all of the attention. The approach I outline in this blog post will work with those tools too, but on this particular project we’re running an open source chat server on-prem called Openfire. Openfire has been around for a long time. I like it because it is open source, easy to install, and will run anywhere you can run Java.

Because it implements an open protocol called XMPP (aka Jabber) there are a variety of chat clients that will work with it. Openfire ships a web-based client called Spark and some of my teammates use that, but most of the time I use Adium on my Mac.

If you need help installing Openfire, take a look at the docs.

Inbound and outbound integration with Elasticsearch

Once your chat solution is working, it’s time to integrate it with Elasticsearch. For my requirements I needed two “directions” for this integration. First, I wanted to be able interrogate one or more of my Elasticsearch clusters from within chat. This “outbound” integration requires a “bot”. There are many open source bots to choose from and examples of bot scripts working with Elasticsearch. I’ll cover both shortly.

The other direction I needed was “inbound”–I wanted my Elasticsearch cluster to be able to tell the chat server when something is wrong with the cluster. This requires something to monitor the health of the cluster (we use Watcher, a paid add-on from Elastic) and a web hook that can use the chat server API to send messages.

Let me cover the outbound implementation–the bot–first. Then I’ll talk about Watcher and the web hook which make up the inbound implementation.

Hubot: An open source chat bot from Github

There are a number of chat bots out there. I went with Hubot from Github. Hubot is based on Node.js. Hubot scripts are written in Coffeescript. However, if you are new to Node or Coffeescript there are plenty of examples out there so don’t let that stop you from using Hubot.

I used this blog post to get Hubot working. However, there were a few gotchas I should point out:

  • I had to use an old version of node.js (0.10.23). The newer version was having a lot of trouble with one of its dependencies and I got tired of fooling with it.
  • The blog post lists some Linux dependencies you need to install, but it leaves one out that’s critical: libicu. On Centos this is libicu-devel and on Ubuntu it is libicu-dev.
  • The blog post specifies some environment variables that need to be set. If you are running Hubot with Openfire, the HUBOT_XMPP_ROOMS variable needs to be set to the fully-qualified conference room name. For example, if the Hubot username is “hubot” running on a host named “grumpy” the variable should be set to hubot@conference.grumpy.
  • You may have to set HUBOT_XMPP_HOST to the hostname of your Openfire server.

Other than that, you should be able to use that blog post to get Hubot and Openfire working.

Hubot and Elasticsearch

There are Hubot scripts that do all sorts of stuff. One of the fun things about adding a bot to your chatroom is to have it do something silly. Maybe every time someone uses the word “Dude” the bot throws out a quote from the Big Lebowski, for example. So you’ll see lots of stuff like that. But there are also more useful examples out there. Here is the one I started with. The hubot-elasticsearch script knows how to use the Elasticsearch API to spit out information about nodes, indices, allocation, and settings. And it allows you to alias your clusters so you don’t have to constantly tell the bot what your URL endpoints are.

Out-of-the-box, the hubot-elasticsearch project is not compatible with Shield, but it’s a decent start. I made a small tweak to get it to work with Watcher, which I’ll cover next.

Watcher: Monitoring and Alerting for Elasticsearch

This particular client is a paying customer of Elastic, which means they are entitled to paid-only add-ons such as Shield (secures the cluster) and Watcher (for monitoring and alerting).

Watcher is pretty handy and we’re glad to have it, but if you aren’t able to use it for some reason, writing your own tool for running tasks on a schedule isn’t too tough. I wrote something similar using Spring MVC and Quartz, for example. You just need something that will periodically interrogate the cluster and then take some action based on some condition. But if you are an Elastic customer there’s no need to build it. The rest of the post assumes that’s the case.

I’ll let you read the Watcher docs to learn more, but at a high level, a watch consists of a trigger, an input, a condition, and an action. The trigger is the schedule. The input might be an Elasticsearch query or the response from some random HTTP endpoint. The condition looks at the input and then decides whether or not action is needed. The action taken might be to send an email, create some data in Elasticsearch, or invoke a web hook.

For my needs, the web hook action is perfect–if one of my watch conditions is met, like maybe something goes wrong with my cluster and the cluster state goes to red, Watcher will invoke my web hook which will post a message in the chat room. Here’s what the action part of my watch definition looks like:

"actions": {
    "notify_chat": {
        "webhook": {
            "method": "POST",
            "host": "localhost",
            "port": 8008,
            "path": "/chat",
            "headers": {
                "Content-Type": "application/json"
            },
            "body": "cluster_health alert: Someone needs to look at the DEV cluster. It appears to be in a RED state."
        }
    }
}

Watcher can have any number of actions listed for a given watch. In this case, I’m using a single “webhook” action called “notify_chat” that does a POST to a URL running on port 8008. That URL could be anything, and it can include basic authentication.

Web Hook: Spring Boot, Spring MVC, and Smack

I’ve been using Spring Boot lately when I need to knock out a quick RESTful API. In this case, I just needed something to listen to the “/chat” end point. When it is called, the code grabs the message posted to it and uses the Smack API to connect to the chat room and post the message. This webapp is probably less than 10 lines of code and Spring Boot packages it up nicely for me.

If you need help with this part take a look at the Smack API Multi User Chat docs.

Tweaking the bot to allow watch acknowledgement

Watcher can throttle or suppress actions based on a time period (“Don’t tell me about this condition again for 5 minutes,” for example) or explicit acknowledgement. If a watch is triggered that uses explicit acknowledgement, I want to be able to acknowledge that from within chat. You already saw that the Elasticsearch Hubot script can talk to the cluster. It’s pretty easy to tweak the script to allow Watcher acknowledgement.

First, I added a function called “ackWatch” that actually does the work of acknowledging the watch:

ackWatch = (msg, watch_id, alias) ->
  cluster_url = _esAliases[alias]

  if cluster_url == "" || cluster_url == undefined
    msg.send("Do not recognize the cluster alias: #{alias}")
  else
    msg.send("Acknowledging watch: #{watch_id}")

  msg.http("#{cluster_url}/_watcher/watch/#{watch_id}/_ack")
    .put() (err, res, body) ->
      msg.send("Acknowledged")

Then, I added the regular expression that the bot should be listening for:

robot.hear /elasticsearch ack (.*) (.*)/i, (msg) ->
  if msg.message.user.id is robot.name
    return

  ackWatch msg, msg.match[1], msg.match[2], (text) ->
    msg.send text

With that in place, any user in the chat room can acknowledge a watch by typing, “hubot: elasticsearch ack some_watch some_alias” where some_watch is the ID of a watch and some_alias is the nickname for the cluster you’re talking about (like “dev”, “qa”, or “prod”, for example).

Putting it all together: A short demo

With all of this in place, my Elasticsearch clusters can tell the team when something interesting is going on and the team can acknowledge that alert and do preliminary investigation by interrogating the cluster, all from the comfort of their chat window.

The video below shows this working. In it, I create a simple watch that invokes a web hook to post a message to the chat room when a watch condition is met.

The demo uses a simple example where the alert is triggered when the test index is a certain size. But you could easily wire up any watch to the same action, such as when your cluster state goes red or when CPU or RAM reach a certain threshold.

This was relatively simple to put together, but hopefully you can see how you could build on this to automate all kinds of things related to monitoring, alerting, and administration of your Elasticsearch cluster from chat.

Posted in Collaboration, Elasticsearch, Open Source, Search | Tagged , , , , , | Leave a comment

Would the commercial open source software you depend on survive a zombie apocalypse?

2596483147_58d6bae3b1_mPick a commercial open source software product that’s important to you. Now imagine that there’s been a global zombie attack. Fortunately, after much heroics and stylized cinematic violence, the attack is eventually thwarted. Unfortunately, it wasn’t soon enough. Everyone who receives a paycheck from the company behind your favorite commercial open source company is now a zombie in the steadfast pursuit of acquiring brains for food–they could care less about your sev one support ticket. The question is this: Can the open source software project survive without its commercial backer?

My blog readership consists of multiple audiences. For some of you, the question posed is interesting because your company depends on that software to run its business. For you, this is a “sole source supplier” problem with the primary concern being: If the software vendor goes away, is there another place I can get the same software?

Some of you are like me–you provide professional services around commercial open source software. The commercial company produces the software you deliver to your clients but may also provide other types of assistance such as marketing, business development, and training.

Still other readers might be classified as “community members” who give time and resources toward the project. Members of the first two groups are often members of this group as well. It is this group–the community–that becomes critical in answering this survivability question as we shall soon see.

Okay, so back to the scenario. Zombies attack. Zombies subdued. Sadly, all of the employees of our commercial open source software vendor are no more. Why is the question of survivability even a question? Remember, we are talking about open source software with a primary commercial backer. Your organic open source projects are safe from zombie attacks, for the most part, because there is no single company responsible for its development. For organic open source projects with a large, healthy community, the load of building and maintaining software is distributed across potentially hundreds or thousands of individuals employed by many different companies. Commercial open source, on the other hand, has a single company that pumps significant dollars into engineering, research and development, sales, and marketing. If that company goes away, there’s a clear risk to the project. Stakeholders need to understand this risk.

For the rest of this analysis, I’m going to ignore the very real investments commercial open source vendors make around marketing, business development, community management, and training. These things matter, of course, but they don’t matter at all unless you ship code. As the smoke clears from the zombie attack, the single most important thing facing the surviving stakeholders is getting out the next release.

Suppose our zombie-stricken company sold support for what was already an extremely successful open source project. There are many people who regularly make very significant contributions to the code base and documentation. There are lots of people answering questions in forums and on IRC. That healthy community means there is a viable population of non-zombies who will obviously mourn the loss of their former collaborators, but will no doubt bravely carry on.

Now, instead, imagine a commercial vendor who is really open source in license only–they lack a community entirely. There are exactly zero people outside of the company who know anything at all about where to get the code and how to build and test it, let alone the intricate details of how it works. It’s clear this project is doomed. Maybe someone would fork the code base and start a new company but that would be tough. Unlike what MariaDb did after the Oracle-MySQL acquisition, our scenario doesn’t allow a new company to bootstrap with experienced founders or engineers, at least those who were employed by the vendor at the time of the zombie attack.

From looking at these two extreme ends of the spectrum it is clear that the thing that allows the project to continue is the viability of the project as an independent open source project and that depends on the health and makeup of the community.

To answer this question–whether or not a commercial open source project could operate as an independent organic open source project–there are several aspects of the current community to assess:

Pre-commercial open source success. If the software project was a successful organic open source project before the commercial company appeared, this is a good indicator that if that company were to go away, the software project would survive.

Governance model. Is the software and associated trademarks under the control of an independent foundation? If so, that’s good–a lot of the details and processes will already be taken care of. Formal governance of this sort is designed around ensuring viability independent of a single commercial backer.

Number of external committers. How many committers are employed by the commercial company? If the answer is 100% that’s bad for survivability. In our scenario anyone working for the company is now a zombie which means there will be fewer experienced people to work on the next release.

Number of external contributions. What kind of contributions have typically come from the community? If it is low volume or mostly insignificant patches, this is another negative indicator. If the project stands a chance of survival post zombie attack it needs significant activity from independent contributors because that provides a population of people who are familiar with the technical details of the product.

Install base. A product installed in a large number of big companies would be ideal but even a few large “anchor” companies would be sufficient. The surviving community will need to pool its resources to move the project forward. If it is strategic to stakeholders with deep pockets, those stakeholders may be willing to invest in the software’s future.

Product complexity. The simpler, the better. When a product gets complex, its engineering team starts to specialize. A large community could do that too, but if the product is complex and the surviving community is too small, there may not be enough people who can go deep enough on every part of the system to maintain the entire thing. A complex product also requires more resources to test and build.

Upstream dependencies. If the product is a relatively thin veneer on a handful of well-known upstream dependencies, that’s good. It means you might be able to depend on some of those upstream communities for help. However if the system is complex and has a lot of smaller dependencies, it means the surviving community has a lot to learn and the upstream communities may lack the resources to be of much help.

Downstream dependencies. If the software is used by many other projects downstream, that’s a good thing because it significantly increases the number of people interested in keeping the project alive.

Company diaspora. How much turnover has there been over the life of this company’s engineering department? Low turnover means there will be fewer people in the market who know the deep dark secrets of the code base.

Community makeup. This is closely related to the “external contributions” aspect. If the community is mostly “power users” that’s a problem for survivability. What you need is a community that has a significant number of individuals who are passionate and technical enough to dedicate time and resources to moving the software forward. It obviously helps if the software is strategic to their employers.

You might assess your favorite commercial open source software and realize that it isn’t viable without commercial backing. That doesn’t mean it’s a bad company or a bad product. What it does mean depends on your perspective:

If you depend on the software for your business…

  • And you are evaluating the software for purchase, you cannot give “availability of source code” a much higher score against other alternatives because one of the benefits you derive from that–the ability to continue forward should the company go away–is less likely to be realized. (see “Why Clients Choose Open Source”).
  • And you already have the software installed, have a backup plan in mind should something happen to the vendor, just like you would a proprietary vendor. If you deem a major event to be likely, consider moving to an alternative now.

If you are part of the commercial open source software project’s community…

  • Push for changes in the aspects identified above that would make the project more viable as an independent project. Realize, however, that not all vendors are enlightened (or empowered) enough to execute these changes.
  • Be okay with the fact that you are dedicating time and energy to something that depends mostly (or entirely) on its commercial backing. This means not getting bent out of shape when the commercial company does something to further its commercial interests. Or don’t be okay with it and move on.
  • Assess your zombie attack survivability as a community and identify initiatives that address areas of weakness.

If you are the commercial open source software vendor…

  • Be careful how you market your open source-ness. If your software fails the zombie survivability test but you tout open source as a major advantage, your marketing message may ring flat.
  • Be open to ideas that could give your software the ability to outlive you. This makes it more attractive to customers and community members alike. It makes your open source claim more genuine.

Obviously the specific threat of a zombie attack is far-fetched. But the risk of a commercial open source company being acquired, folding, or radically shifting their business model is quite real. Just because “commercial open source” has “open source” in it, does not magically give it a vibrant community that can fulfill the vital engineering and quality assurance role that the commercial company often provides. Even a vibrant community may lack the proper make-up to step in should it need to.

Regardless of which role you play, if you are a stakeholder in a commercial open source software product, it is important you assess this risk and have a plan should the need arise.

Posted in Open Source | Tagged , , | 3 Comments

How my teen-aged son became a Mozilla contributor

Jeff and Justin zip-lining during Mozilla Work Week in Whistler

Jeff and Justin zip-lining during Mozilla Work Week in Whistler

I’m in Orlando for Mozilla Work Week, which is an event that happens a couple of times a year aimed at bringing the organization’s far-flung employees and contributors together to collaborate on projects.

But I’m not here because of my own code contributions–I’m here as a chaperone. My 17 year-old son, Justin, earned his third trip to Work Week as an invited guest by dedicating time and code towards Mozilla’s mission of a free and open web.

Mozilla has thousands of contributors worldwide. Only a handful get invited to Work Week, so this is a Proud Open Source Dad Moment® for me, but I thought I’d share his story in the off chance that it motivates you or your kids to participate.

A few years ago Justin asked how he could get more involved in an open source project. He had just finished Google Code-In, which matches up High School students with open source projects. It’s a cool program, but it only lasts a few months and the projects he contributed to were somewhat esoteric. He was looking for two things in a project: it needed to mean something to him personally so that it would hold his interest and it needed to offer a technical ramp that would help grow his skills.

I told him to take a look at Mozilla. My own experience contributing to Mozilla was that they were good at getting contributors aligned with projects based on their skills and interests (check out their signup page).

At the time, Justin was new to coding. I had taught him Python but he wasn’t feeling confident enough to put those skills to use on a public project. So he started reviewing and editing technical documentation. This was a perfect place to start because it offered a chance to learn more about the culture of the organization and its tools and processes while also being exposed to the dizzying array of products, acronyms, and technologies in play at Mozilla.

(For more on teaching kids to code, see “Kids these days: Learning to code then and now“).

Then one day I walked into Justin’s room. He was hammering away at his laptop. “What are you working on?”, I asked. Without looking up he replied, “I’m writing some automated test scripts in Python using Selenium.” He was clearly “in the zone” so I just said, “Oh, cool,” and headed back down the hall, but I was pretty pumped–my kid was committing code.

Justin had found his way onto the Web Quality Assurance team. They make sure all of Mozilla’s sites are running correctly. This led to a cool father-son moment. I had been contributing some code to Mozillians, which is one of the sites Justin’s Web QA team was responsible for. Justin came to me with a problem: He could improve the efficiency of his tests if a small change was made to one of the site’s pages. He was reluctant to make the change but he knew I could do it quickly, so I did, and we ended up spending an afternoon hanging out in his room, squashing bugs and testing.

As his confidence and skills grew, so did his influence and responsibilities within Mozilla. He became a “vouched” Mozillian and, later, was added to some of the internal systems reserved for trusted contributors, and was ultimately granted the ability to commit code directly to Web QA’s projects. He participated on as many of the team’s online meetings as his school schedule would allow, began mentoring other contributors, and was given his own project to run.

This gradual increase in trust and recognition that occurs over time as a contributor spends time with a project is a key part of what makes open source work. He and I had discussed “open source” as a concept a lot, but it didn’t really sink in until he became part of it, and it has been wonderful to watch.

Something that’s been particularly instructive is how Mozilla treats its contributors. During a family vacation to San Francisco we toured both of Mozilla’s offices. Every person we were introduced to stopped what they were doing and thanked us. We felt special and appreciated.

This attention extends to code contributions. When I made my first contribution I worked with a mentor of sorts who helped me understand the workflow and coding standards for the project. If he didn’t hear from me in a while he’d shoot me a note to see if I still had time to work on the project. His continued interest made me feel like my contributions mattered, even if they were small.

Justin’s mentor went even further, offering career advice and making personal introductions. He’s even been helping Justin find internship opportunities.

When I think about everything his Mozilla experience has given him, I’m amazed. Of course he’s been apply to apply his coding skills in real world applications. But there is also a set of important technical skills that are hard to teach in a classroom, like how to be productive in a multi-developer git workflow. Perhaps more crucial are the softer skills, like how to give constructive feedback to teammates or how to pitch an idea. And the cool thing is that while he’s soaking up all of this, he’s helping further Mozilla’s mission, which is something both of us believe in.

If you know someone who believes in a free and open web, and they have the time and interest to get involved and to stick with it, encourage them to signup and get involved. There’s plenty to do for all skill levels and interests.

Posted in Open Source | Tagged , , , , , | 17 Comments

Back to the Future of Content Repositories

mcflyFive years ago I wrote a blog post called, “Alfresco, NoSQL, and the Future of ECM“. Today is “Back to the Future Day”–the exact date Marty McFly time-traveled to in the movie Back to the Future. The movie made many observations about what life would be like on October 21, 2015–some were spot on, some not so much. I thought it fitting that we take a look at my old blog post and see where we are now with regard to content repositories and NoSQL.

One point of the post was that NOSQL might be a more fitting back-end for content repositories than relational:

But why shouldn’t the Content Management tier benefit from the scalability and replication capabilities of a NOSQL repository? And why can’t a NOSQL repository have an end-user focused user interface with integrated workflow, a form service, and other traditional DM/CMS/WCM functionality? It should, it can and they will.

This has definitely turned out to be the case. New content management solution vendors like CloudCMS have built their platform on NOSQL technology while older vendors, such as Nuxeo, have started to integrate NOSQL into their solutions.

Open source projects are also taking advantage of the technology. Apache Jackrabbit provides an implementation of the JCR standard. Its “next generation” offering, Jackrabbit Oak, is essentially JCR with MongoDB as the back-end.The second point of the post was that as NOSQL repositories become more widely adopted, they compete directly with content repositories in use cases where those content repositories are used primarily as a back-end for developers’ custom content-centric solutions.

In other words, 5 or 10 years ago, if you were a developer looking to implement a custom application, and you wanted something other than a relational back-end, you might build your application on top of something like Alfresco. Now developers may be less likely to go that route. That’s because today there is an explosion of stacks out there. Many of them assume a NOSQL back-end. Look at mean.io as just one example, which combines Node.js, Express (the Node.js web framework), AngularJS, and MongoDB and wraps it up with time-saving tooling.

Many people use Alfresco as a back-end. Their front-end uses a RESTful API implemented as web scripts to talk to the repository. The value the repository brings to the table is the ability to store documents in a hierarchy along with custom metadata defined in a content model. They may not be using Alfresco Share or much of the other functionality that Alfresco bundles with their offering–for their custom solution Alfresco is just a repository. When it is used like this, Alfresco is doing nothing more than what NOSQL repositories offer, and, in fact, it does less because NOSQL repositories have a more flexible schema and are built to be clustered and massively distributable–for free.

Years ago, Alfresco shifted its focus away from developers looking to build custom solutions on top of a bare repository. Its developer outreach is now more about customizing Alfresco Share and the underlying repository. Nuxeo, on the other hand, has doubled-down on its developer focus. I’ll spend some more time on this in a future post.

I guess this trend wasn’t terribly hard to predict five years ago, but it does feel kinda nice to see it come to pass. Now, if I could just have a hoverboard.

Posted in Alfresco, Apache Jackrabbit, Content Management, Content-as-a-Service | 8 Comments

10 Things to Consider When Planning Your Elasticsearch Project

elastic_logo_color_horizontalI am seeing a lot of interest in Elasticsearch from clients and colleagues. Elasticsearch is an open source search engine that is commercially supported by a company called Elastic. It’s used for web search, log analysis, and big data analytics. You’ll often see it compared with Apache Solr. Both depend on Apache Lucene for low-level indexing and analysis. People like Elasticsearch because it is easy to install, scales out to hundreds of nodes with no additional software needed, and is easy to work with thanks to its built-in RESTful API.

Multiple folks have asked me what they need to think about when leveraging Elasticsearch as part of their solution, so I thought I’d summarize those thoughts and share them here. This isn’t a detailed technical list but is more like a set of buckets that need time and attention.

1. Cluster sizing

The nice thing about Elasticsearch is how easy it is to scale out. But you should still have an idea of the near- and medium term hardware footprint. Indexing and querying time can vary depending on many factors and every installation is different. You’ll want to establish your “unit of scale” early so you know roughly what you’ll have to do to get a target level of throughput and CPU utilization.

I wrote a blog post on using Apache JMeter to load-test Elasticsearch, which works well for establishing how much load your cluster can take and where the bottlenecks are.

2. Cluster footprint

Related to cluster sizing is your cluster footprint. Elasticsearch nodes can be master nodes, data nodes, client nodes, or some combination. Most people opt for dedicated master nodes (3 at a minimum) and then some number of data and client nodes.

I like using dedicated nodes for everything because it separates responsibilities and lets you optimize each type of node for its particular workload. For example, I’ve seen a performance boost by separating client and data nodes. The client nodes handle the incoming HTTP requests which leaves the data nodes to service the queries.

Like sizing, the footprint that works well for you depends on what you’re doing, so use something like JMeter to test repeatedly until you get it right.

3. Security

You’ll need to secure your Elasticsearch cluster, both between the application/API and Elasticsearch layers and between the Elasticsearch layer and your internal network. Shield, which is a paid product from Elastic, can take you a lot of the way here and if you pay for support from Elastic, Shield is included.

One of my projects uses Shield to provide LDAP authentication, to encrypt all data between Elasticsearch nodes with SSL, and to control authorization for all of the indices in the cluster. We’ve been happy with it so far.

If you can’t justify a support subscription you’ll need to do something else to prevent unauthorized access to your cluster. Using something like nginx as a proxy is a common choice.

4. Index/Alias/Type Mapping approach

You might call this your data partitioning and data modeling approach. You should figure out early what your approach to indices and aliases will be. You’ll definitely want to use aliases–that’s a given. Aliases insulate your app from index name changes among other things. But some thought also needs to be given to how you partition data across indices.

You’ll also need to identify how you’ll leverage type mappings. Elasticsearch is schema-less but type mappings of some kind are almost always needed so that Elasticsearch knows how to index the data (longs versus dates versus strings, for example).

I’m building a dynamic content service on top of Elasticsearch for one of my clients. They have many different types of content that will be indexed in Elasticsearch and returned to their e-commerce app as JSON chunks. A lot of time is going into defining the JSON structure for those types which ultimately gets translated into type mappings.

It is worth spending some time looking at index templates, default mappings, and dynamic mappings and thinking about how you will manage your mappings as the number of types grows.

5. Query approach & relevance tuning

The Elasticsearch query DSL is vast. At a high level you will deal with queries and filters depending on exactly what you need to do. You’ll want to avoid queries, if possible, and lean toward filters. They are much, much more performant.

More than just query design, you’ll want to figure out how you’re going to expose queries to the API. On a recent project we started by having our Java-based API layer translate developer-friendly query string params into Elasticsearch filters. We didn’t stick with that, though, because tuning and tweaking our queries required the API layer to be re-compiled and deployed. We now do everything with search templates which pulls our query logic out of the Java code and makes it easier to manage.

Understanding how to write efficient queries is one thing, but making them return the results that end-users expect is another. Once written, expect to spend some time tweaking analyzers and scoring so that the engine returns the right hits. If this is a particular concern for you take a look at the Relevant Search book from Manning.

6. Monitoring & Alerting

Be sure to factor in a completely separate “monitoring” cluster that will only be used to capture stats about the health of the cluster and alert you when something goes wrong. Two tools that work great for this are Marvel and Watcher.

Marvel keeps track of the health of the cluster and Sense (built-in to Marvel) is used to run ad hoc operations against the cluster. Marvel includes a dashboard that reports on the health of the cluster.

Elastic just released a new tool called Watcher. It watches for certain conditions and alerts you when those conditions are met. So when some stat (JVM heap, for example) reaches a threshold you can take some action (send an email, call a web hook, etc.).

Watcher isn’t just for monitoring the health of the cluster. Watcher can monitor searches against any index. In fact, Watcher can invoke any HTTP end point and then take action based on what comes back.

7. Node provisioning and config management

Once you have more than a handful of nodes it becomes challenging to keep every node in sync with regard to software versions, configuration, etc. There are a number of open source tools that can help with this. I’ve used both Chef and Ansible to help manage Elasticsearch clusters. By far, my favorite tool for this is Ansible. It automates upgrades and configuration propagation without requiring any additional software to be installed on any of the Elasticsearch nodes.

You may not see a huge need for automation now, but if you’re going to start small and grow, you’ll want to be able to grow quickly. Having a library of common tasks scripted with Ansible will allow you to go from bare server to fully-provisioned Elasticsearch node in minutes with no manual intervention.

In addition to automating installs and config changes, you’ll have a need for scheduling routine administrative tasks like copying an index or cleaning up old indices that Marvel and Watcher create daily. I use a “job server” that I built from open source components to do this. Cron jobs are also a common approach.

8. Backup and recovery

Properly tuned, indexing can run pretty fast even for very large data sets. So some people opt to simply re-index if they lose data. Elasticsearch has built-in “snapshot” functionality that can back up your indices. If you do something to handle scheduled operations (see “job server”, above) then taking regular snapshots easy to do. Relying on OS-level file system backups may be dicey once you have multiple nodes due to how the data is stored.

9. API & UI development

It is likely that you’ll put Elasticsearch behind an API layer that provides an agnostic API to applications that are leveraging your search cluster. You may also want to do some transformation of input or output before and after requests hit Elasticsearch. Exactly what you use for this is up to you–there are Elasticsearch clients for most popular languages and you can always just use the REST endpoints if needed. I’ve implemented this layer using Node.js, Java, and Lua and they each have pluses and minuses, as usual.

Everything you index into Elasticsearch is JSON. So any tool that can speak HTTP and post JSON can be used to work with the server. Elastic offers a tool called Marvel that embeds another tool called Sense (also available as a Chrome extension) that is extremely useful for doing this. Of course command-line tools like curl also work well.

Depending on the makeup of your team and the use case, you may find that writing a custom UI to manage the documents indexed into Elasticsearch, rather than using Sense or curl, is the way to go. This is just a web development task, thanks to the Elasticsearch REST API, but obviously it takes time that needs to be accounted for in your plans.

10. Data indexing

It is easy to index data into Elasticsearch. Depending on the data source and other factors, you might write this yourself or you can use another tool from Elastic called Logstash. Logstash can watch log files or other inputs and then efficiently index the data into your cluster.

Summary

Installing and running an out-of-the-box Elasticsearch cluster is easy. Making it work for your exact use case and keeping everything humming along takes a bit more effort. Hopefully this list has given you a rough idea of the areas where you’ll likely need to spend time as you move forward with your Elasticsearch project.

Posted in Elasticsearch | Tagged , | 7 Comments

Alfresco cancels Summit, asks community to organize its own conference

summit-community-editionEarlier this week, in a post to a public mailing list, Ole Hejlskov, Developer Evangelist at Alfresco, announced that the company will not be putting on its annual conference, Alfresco Summit, this year as originally planned. Instead, the company is focusing on smaller, shorter, sales-oriented events which have been very successful in several cities around the globe.

Ole said that Alfresco will be adding developer content to its Alfresco Day events, which have historically been mostly end-user and decision-maker focused. In contrast, Alfresco’s yearly events started out as developer-focused conferences, but in recent years had a more balanced agenda with both technical and non-technical tracks.

Alfresco had announced earlier in the year that their annual conference would be in New Orleans in November. In each of the last five years the company put on two conferences–one in Europe and the other in United States. For 2015 the plan was to have a single conference only in the U.S. which drew criticism from the community that skews heavily toward a non-U.S. demographic.

When the community realized Alfresco Summit 2015 would be held only in the U.S., an independent community organization called The Order of the Bee began making plans to hold their own conference in Europe. Alfresco says it will support the community’s efforts to hold its own event and wants to explore “…ways in which participation from Alfresco corporate makes sense”.

I understand where Alfresco is coming from. Annual conferences are expensive in both real dollars and the time and attention it takes to plan and execute. When you multiply that times two it obviously represents an even bigger investment.

You also have to look at what Alfresco gets out of the conference. Alfresco is increasingly sales-focused. The conference has historically been focused on knowledge-sharing and camaraderie. Yes, there were deals closed at Alfresco Summit but it was not geared towards selling. It was more about coming together to share stories, good and bad.

The Alfresco Day events are unabashedly sales and marketing. The attendees (and they get very large turnouts) know this which means Alfresco does not have to apologize for coming off too sales-y. Multiple cities with hundreds of prospects is a better investment for them than two cities with 1400 attendees who are existing customers and community members.

As the guy who led DevCon and Alfresco Summit and together with my team grew it year after year, it is weird to see Alfresco cancel the conference for 2015. I was looking forward to attending.

As a member of The Order of the Bee, I’m intrigued by the challenge of using an all-volunteer organization to potentially put together a replacement conference of some sort. If you have any interest in helping and you did not see my email to the mailing list, we’ll probably be meeting next week to get organized. Reach out to me and I’ll add you to the invitation.

Posted in Alfresco, Alfresco Community, Alfresco DevCon, Alfresco Summit | Tagged , , | 25 Comments

The plain truth about Alfresco’s open source ethos

There was a small flare-up on the Order of the Bee list this week. It started when someone suggested that the Community Edition (CE) versus Enterprise Edition comparison page on alfresco.com put CE in a negative light. In full disclosure, I collaborated with Marketing on that page when I worked for Alfresco. My goal at the time was to make sure that the comparison was fair and that it didn’t disparage Community Edition. I think it still passes that test and is similar to the comparison pages of other commercial open source companies.

My response to the original post to the list was that people shouldn’t bother trying to get the page changed. Why? Because how Alfresco Software, Inc. chooses to market their software is out-of-scope for the community. As long as the commercial company behind Alfresco doesn’t say anything untrue about Community Edition, the community shouldn’t care.

The fact that there is a commercial company behind Alfresco, that they are in the business of selling Enterprise support subscriptions, and at the same time have a vested interest in promoting the use of Community Edition to certain market segments is something you have to get your head around.

Actually there are a handful of things that you really need to understand and accept so you can be a happy member of the community. Here they are:

1. CE is distributed under LGPLv3 so it is open source.

If you need to put a label on it and you are a binary type of person, this is at the top of the list. Alfresco is “open source” because it is distributed under an OSI-approved license. A more fine-grained description is that it is “open core” because the same software is distributed under two different licenses, with the enterprise version being based on the free version and including features not available in the free version.

2. Committers will only ever be employees.

There have been various efforts over the years to get the community more involved in making direct code contributions. The most recent is that Aikau is on github and accepting pull requests. Maybe some day the core repository will be donated to Apache or some other foundation. Until then, if you want to commit directly to core, send a resume to Alfresco Software, Inc. I know they are hiring talented engineers.

3. Alfresco Software, Inc. is a commercial, for-profit business.

Already mentioned, but worth repeating: The company behind the software earns revenue from support subscriptions, and, increasingly, value-added features not available in the open source distribution. The company is going to do everything it can to maximize revenue. The community needs this to be the case because a portion of those resources support the community product. The company needs the community, so it won’t do anything to aggressively undermine adoption of the free product. You have to believe this to be true. A certain amount of trust is required for a symbiotic relationship to work.

4. “Open source” is not a guiding principle for the company.

Individuals within the company are ardent open source advocates and passionate and valued community members, but the organization as a whole does not use “open source” as a fundamental guiding principle. This should not be surprising when you consider that:

  1. “Drive Open Innovation” not “Open Source” is a core value to the company as publicly expressed on the Our Values page.
  2. The leadership team has no open source experience (except John Newton and PHH whose open source experience is Alfresco and Activiti).
  3. The community team doesn’t exist any more–the company has shifted to a “developer engagement” strategy rather than having a dedicated community leadership or advocacy team.

Accept the fact that this is a software company like any other, that distributes some of its software under an open source license and employs many talented people who spend a lot of their time (on- and off-hours) to further the efforts of the community. It is not a “everything-we-do-we-do-because-open-source” kind of company. It just isn’t.

5. Alfresco originally released under an open source license primarily as a go-to-market strategy

In the early days, open source was attractive to the company not because it wanted help building the software, but because the license undermined the position of proprietary vendors and because they hoped to gain market share quickly by leveraging the viral nature of freely-distributable software. Being open was an attractive (and highly marketable) contrast to the extremely closed and proprietary nature of legacy ECM vendors such as EMC and Microsoft.

I think John and Paul also hoped that the open and transparent nature of open source would lend itself to developer adoption, third-party integrations and add-ons, and a partner ecosystem, which it did.

I think it is this last one–the mismatch between the original motivations to release as open source and what we as a community expect from an open source project–that causes angst. The “open source” moniker attracts people who wish the project was more like an organic open source project than it can or ever will be.

For me, personally, I accepted these as givens a long time ago–none of them bother me any more. I am taking this gift that we’ve been given–a highly-functional, freely-distributable ECM platform–and I’m using it to help people. I’m no longer interested in holding the company to a dogmatic standard they never intended to be held to.

So be cool and do your thing

The “commercial” part of “commercial open source” creates a tension that is felt both internally and externally. Internal tension happens when decisions have to be made for the benefit of one side at the expense of the other. External tension happens when the community feels like the company isn’t always acting in their best interest and lacks the context or visibility needed to believe otherwise.

This tension is a natural by-product of the commercial open source model. It will always be there. Let’s acknowledge it, but I see no reason to antagonize it.

If you want to help the community around Alfresco, participate. Build something. Install the software and help others get it up and running. Join the Order of the Bee. If you want to help Alfresco with its marketing, send them your resume.

Posted in Alfresco, Alfresco Community, General | Tagged , , , | 39 Comments

When to consider Cloud CMS for your content management project

Cloud CMS LogoCloud CMS announced today that it has added support for CMIS. This is a nice addition for all sorts of reasons, but near the top from Cloud CMS’s perspective is that it makes it easier to migrate content from existing solutions into the Cloud CMS repository.

Back in November I did a series of reviews on content-as-a-service providers. One of my posts was on Cloud CMS. The post assumes you are looking for hosted content-as-a-service and shows how Cloud CMS compares to other cloud offerings.

What I think we’re going start seeing more and more, however, are people who might consider Cloud CMS as an alternative to traditional on-premises ECM vendors like Alfresco, Nuxeo, Documentum, and Microsoft. Although Cloud CMS was originally built to be a hosted, content-centric, back-end for mobile and web applications, it can just as easily function as your hosted intranet or document management repository.

With custom content models, event triggers, and custom workflows, you may find that the only difference between Cloud CMS and your current on-premises document repository is that you don’t have to worry about software or hardware installation and upgrades any longer.

Considering Cloud CMS as an alternative to traditional players may make sense when:

  1. A 100% cloud-native solution is preferred. While Cloud CMS could be run on-premises it is certainly built to be hosted on your behalf. Plus, one of the benefits of letting Cloud CMS run, upgrade, and scale your repository is so you don’t have to.
  2. Customization is important. Some of the traditional vendors have made cloud add-ons for their products, but they then lock down the content model and the user interface so that it cannot be customized. Cloud CMS offers the benefits of hassle-free operations while maintaining your ability to customize it to meet your exact requirements.
  3. Budget is constrained. Clients who need enterprise-grade features and the peace of mind that a support contract brings but can’t justify the high cost of a traditional vendor’s enterprise license may find Cloud CMS to be a lower-cost alternative. Rather than licensing by the seat or the server, Cloud CMS cost is based on how and how much you use the system.

Clients who have very straightforward needs (simple file sync and share, for example) will probably choose something a little more utilitarian, like Google Drive, Box, Dropbox, or Amazon Zocalo. And, despite Cloud CMS recently having undergone an extensive security audit, I know some clients may still be reluctant to move to the cloud. Everyone else, though, should take a hard look at Cloud CMS.

 

Posted in Content Management | Tagged , , , | 7 Comments

Alfresco tutorials updated to SDK 2.0 and Alfresco 5.0

I’ve recently updated the Alfresco Developer Series tutorials to work with version 2.0 of the Alfresco SDK and Alfresco 5.0.d (and Enterprise 5.0).

Note that the SDK is not backwards compatible. If you are running Alfresco 4.x you need to use the older version of the SDK. When you move to 5.0 you need to move to SDK 2.0. The steps to do that are roughly:

1. Merge your pom.xml with the one generated by the 2.0 archetype.
2. Copy/merge tomcat/context.xml.
3. Copy run.sh from a 2.0 project into yours. Only needed if you are using spring-loaded.
4. Copy/merge src/test/resources.
5. Copy/merge src/test/properties.

That part was easy for all of the tutorial projects. The time-consuming part was just updating the screenshots and a few of the steps. The code stayed the same across all projects.

If you are still on 4.x and you want to use the tutorials that are specific to the older version, just use the source tagged with “4.x” on github.

Posted in Alfresco, Alfresco Tutorials | Tagged | 17 Comments