Month: July 2015

10 Things to Consider When Planning Your Elasticsearch Project

elastic_logo_color_horizontalI am seeing a lot of interest in Elasticsearch from clients and colleagues. Elasticsearch is an open source search engine that is commercially supported by a company called Elastic. It’s used for web search, log analysis, and big data analytics. You’ll often see it compared with Apache Solr. Both depend on Apache Lucene for low-level indexing and analysis. People like Elasticsearch because it is easy to install, scales out to hundreds of nodes with no additional software needed, and is easy to work with thanks to its built-in RESTful API.

Multiple folks have asked me what they need to think about when leveraging Elasticsearch as part of their solution, so I thought I’d summarize those thoughts and share them here. This isn’t a detailed technical list but is more like a set of buckets that need time and attention.

1. Cluster sizing

The nice thing about Elasticsearch is how easy it is to scale out. But you should still have an idea of the near- and medium term hardware footprint. Indexing and querying time can vary depending on many factors and every installation is different. You’ll want to establish your “unit of scale” early so you know roughly what you’ll have to do to get a target level of throughput and CPU utilization.

I wrote a blog post on using Apache JMeter to load-test Elasticsearch, which works well for establishing how much load your cluster can take and where the bottlenecks are.

2. Cluster footprint

Related to cluster sizing is your cluster footprint. Elasticsearch nodes can be master nodes, data nodes, client nodes, or some combination. Most people opt for dedicated master nodes (3 at a minimum) and then some number of data and client nodes.

I like using dedicated nodes for everything because it separates responsibilities and lets you optimize each type of node for its particular workload. For example, I’ve seen a performance boost by separating client and data nodes. The client nodes handle the incoming HTTP requests which leaves the data nodes to service the queries.

Like sizing, the footprint that works well for you depends on what you’re doing, so use something like JMeter to test repeatedly until you get it right.

3. Security

You’ll need to secure your Elasticsearch cluster, both between the application/API and Elasticsearch layers and between the Elasticsearch layer and your internal network. Shield, which is a paid product from Elastic, can take you a lot of the way here and if you pay for support from Elastic, Shield is included.

One of my projects uses Shield to provide LDAP authentication, to encrypt all data between Elasticsearch nodes with SSL, and to control authorization for all of the indices in the cluster. We’ve been happy with it so far.

If you can’t justify a support subscription you’ll need to do something else to prevent unauthorized access to your cluster. Using something like nginx as a proxy is a common choice.

4. Index/Alias/Type Mapping approach

You might call this your data partitioning and data modeling approach. You should figure out early what your approach to indices and aliases will be. You’ll definitely want to use aliases–that’s a given. Aliases insulate your app from index name changes among other things. But some thought also needs to be given to how you partition data across indices.

You’ll also need to identify how you’ll leverage type mappings. Elasticsearch is schema-less but type mappings of some kind are almost always needed so that Elasticsearch knows how to index the data (longs versus dates versus strings, for example).

I’m building a dynamic content service on top of Elasticsearch for one of my clients. They have many different types of content that will be indexed in Elasticsearch and returned to their e-commerce app as JSON chunks. A lot of time is going into defining the JSON structure for those types which ultimately gets translated into type mappings.

It is worth spending some time looking at index templates, default mappings, and dynamic mappings and thinking about how you will manage your mappings as the number of types grows.

5. Query approach & relevance tuning

The Elasticsearch query DSL is vast. At a high level you will deal with queries and filters depending on exactly what you need to do. You’ll want to avoid queries, if possible, and lean toward filters. They are much, much more performant.

More than just query design, you’ll want to figure out how you’re going to expose queries to the API. On a recent project we started by having our Java-based API layer translate developer-friendly query string params into Elasticsearch filters. We didn’t stick with that, though, because tuning and tweaking our queries required the API layer to be re-compiled and deployed. We now do everything with search templates which pulls our query logic out of the Java code and makes it easier to manage.

Understanding how to write efficient queries is one thing, but making them return the results that end-users expect is another. Once written, expect to spend some time tweaking analyzers and scoring so that the engine returns the right hits. If this is a particular concern for you take a look at the Relevant Search book from Manning.

6. Monitoring & Alerting

Be sure to factor in a completely separate “monitoring” cluster that will only be used to capture stats about the health of the cluster and alert you when something goes wrong. Two tools that work great for this are Marvel and Watcher.

Marvel keeps track of the health of the cluster and Sense (built-in to Marvel) is used to run ad hoc operations against the cluster. Marvel includes a dashboard that reports on the health of the cluster.

Elastic just released a new tool called Watcher. It watches for certain conditions and alerts you when those conditions are met. So when some stat (JVM heap, for example) reaches a threshold you can take some action (send an email, call a web hook, etc.).

Watcher isn’t just for monitoring the health of the cluster. Watcher can monitor searches against any index. In fact, Watcher can invoke any HTTP end point and then take action based on what comes back.

7. Node provisioning and config management

Once you have more than a handful of nodes it becomes challenging to keep every node in sync with regard to software versions, configuration, etc. There are a number of open source tools that can help with this. I’ve used both Chef and Ansible to help manage Elasticsearch clusters. By far, my favorite tool for this is Ansible. It automates upgrades and configuration propagation without requiring any additional software to be installed on any of the Elasticsearch nodes.

You may not see a huge need for automation now, but if you’re going to start small and grow, you’ll want to be able to grow quickly. Having a library of common tasks scripted with Ansible will allow you to go from bare server to fully-provisioned Elasticsearch node in minutes with no manual intervention.

In addition to automating installs and config changes, you’ll have a need for scheduling routine administrative tasks like copying an index or cleaning up old indices that Marvel and Watcher create daily. I use a “job server” that I built from open source components to do this. Cron jobs are also a common approach.

8. Backup and recovery

Properly tuned, indexing can run pretty fast even for very large data sets. So some people opt to simply re-index if they lose data. Elasticsearch has built-in “snapshot” functionality that can back up your indices. If you do something to handle scheduled operations (see “job server”, above) then taking regular snapshots easy to do. Relying on OS-level file system backups may be dicey once you have multiple nodes due to how the data is stored.

9. API & UI development

It is likely that you’ll put Elasticsearch behind an API layer that provides an agnostic API to applications that are leveraging your search cluster. You may also want to do some transformation of input or output before and after requests hit Elasticsearch. Exactly what you use for this is up to you–there are Elasticsearch clients for most popular languages and you can always just use the REST endpoints if needed. I’ve implemented this layer using Node.js, Java, and Lua and they each have pluses and minuses, as usual.

Everything you index into Elasticsearch is JSON. So any tool that can speak HTTP and post JSON can be used to work with the server. Elastic offers a tool called Marvel that embeds another tool called Sense (also available as a Chrome extension) that is extremely useful for doing this. Of course command-line tools like curl also work well.

Depending on the makeup of your team and the use case, you may find that writing a custom UI to manage the documents indexed into Elasticsearch, rather than using Sense or curl, is the way to go. This is just a web development task, thanks to the Elasticsearch REST API, but obviously it takes time that needs to be accounted for in your plans.

10. Data indexing

It is easy to index data into Elasticsearch. Depending on the data source and other factors, you might write this yourself or you can use another tool from Elastic called Logstash. Logstash can watch log files or other inputs and then efficiently index the data into your cluster.

Summary

Installing and running an out-of-the-box Elasticsearch cluster is easy. Making it work for your exact use case and keeping everything humming along takes a bit more effort. Hopefully this list has given you a rough idea of the areas where you’ll likely need to spend time as you move forward with your Elasticsearch project.

Alfresco cancels Summit, asks community to organize its own conference

summit-community-editionEarlier this week, in a post to a public mailing list, Ole Hejlskov, Developer Evangelist at Alfresco, announced that the company will not be putting on its annual conference, Alfresco Summit, this year as originally planned. Instead, the company is focusing on smaller, shorter, sales-oriented events which have been very successful in several cities around the globe.

Ole said that Alfresco will be adding developer content to its Alfresco Day events, which have historically been mostly end-user and decision-maker focused. In contrast, Alfresco’s yearly events started out as developer-focused conferences, but in recent years had a more balanced agenda with both technical and non-technical tracks.

Alfresco had announced earlier in the year that their annual conference would be in New Orleans in November. In each of the last five years the company put on two conferences–one in Europe and the other in United States. For 2015 the plan was to have a single conference only in the U.S. which drew criticism from the community that skews heavily toward a non-U.S. demographic.

When the community realized Alfresco Summit 2015 would be held only in the U.S., an independent community organization called The Order of the Bee began making plans to hold their own conference in Europe. Alfresco says it will support the community’s efforts to hold its own event and wants to explore “…ways in which participation from Alfresco corporate makes sense”.

I understand where Alfresco is coming from. Annual conferences are expensive in both real dollars and the time and attention it takes to plan and execute. When you multiply that times two it obviously represents an even bigger investment.

You also have to look at what Alfresco gets out of the conference. Alfresco is increasingly sales-focused. The conference has historically been focused on knowledge-sharing and camaraderie. Yes, there were deals closed at Alfresco Summit but it was not geared towards selling. It was more about coming together to share stories, good and bad.

The Alfresco Day events are unabashedly sales and marketing. The attendees (and they get very large turnouts) know this which means Alfresco does not have to apologize for coming off too sales-y. Multiple cities with hundreds of prospects is a better investment for them than two cities with 1400 attendees who are existing customers and community members.

As the guy who led DevCon and Alfresco Summit and together with my team grew it year after year, it is weird to see Alfresco cancel the conference for 2015. I was looking forward to attending.

As a member of The Order of the Bee, I’m intrigued by the challenge of using an all-volunteer organization to potentially put together a replacement conference of some sort. If you have any interest in helping and you did not see my email to the mailing list, we’ll probably be meeting next week to get organized. Reach out to me and I’ll add you to the invitation.