I am seeing a lot of interest in Elasticsearch from clients and colleagues. Elasticsearch is an open source search engine that is commercially supported by a company called Elastic. It’s used for web search, log analysis, and big data analytics. You’ll often see it compared with Apache Solr. Both depend on Apache Lucene for low-level indexing and analysis. People like Elasticsearch because it is easy to install, scales out to hundreds of nodes with no additional software needed, and is easy to work with thanks to its built-in RESTful API.
Multiple folks have asked me what they need to think about when leveraging Elasticsearch as part of their solution, so I thought I’d summarize those thoughts and share them here. This isn’t a detailed technical list but is more like a set of buckets that need time and attention.
1. Cluster sizing
The nice thing about Elasticsearch is how easy it is to scale out. But you should still have an idea of the near- and medium term hardware footprint. Indexing and querying time can vary depending on many factors and every installation is different. You’ll want to establish your “unit of scale” early so you know roughly what you’ll have to do to get a target level of throughput and CPU utilization.
I wrote a blog post on using Apache JMeter to load-test Elasticsearch, which works well for establishing how much load your cluster can take and where the bottlenecks are.
2. Cluster footprint
Related to cluster sizing is your cluster footprint. Elasticsearch nodes can be master nodes, data nodes, client nodes, or some combination. Most people opt for dedicated master nodes (3 at a minimum) and then some number of data and client nodes.
I like using dedicated nodes for everything because it separates responsibilities and lets you optimize each type of node for its particular workload. For example, I’ve seen a performance boost by separating client and data nodes. The client nodes handle the incoming HTTP requests which leaves the data nodes to service the queries.
Like sizing, the footprint that works well for you depends on what you’re doing, so use something like JMeter to test repeatedly until you get it right.
You’ll need to secure your Elasticsearch cluster, both between the application/API and Elasticsearch layers and between the Elasticsearch layer and your internal network. Shield, which is a paid product from Elastic, can take you a lot of the way here and if you pay for support from Elastic, Shield is included.
One of my projects uses Shield to provide LDAP authentication, to encrypt all data between Elasticsearch nodes with SSL, and to control authorization for all of the indices in the cluster. We’ve been happy with it so far.
If you can’t justify a support subscription you’ll need to do something else to prevent unauthorized access to your cluster. Using something like nginx as a proxy is a common choice.
4. Index/Alias/Type Mapping approach
You might call this your data partitioning and data modeling approach. You should figure out early what your approach to indices and aliases will be. You’ll definitely want to use aliases–that’s a given. Aliases insulate your app from index name changes among other things. But some thought also needs to be given to how you partition data across indices.
You’ll also need to identify how you’ll leverage type mappings. Elasticsearch is schema-less but type mappings of some kind are almost always needed so that Elasticsearch knows how to index the data (longs versus dates versus strings, for example).
I’m building a dynamic content service on top of Elasticsearch for one of my clients. They have many different types of content that will be indexed in Elasticsearch and returned to their e-commerce app as JSON chunks. A lot of time is going into defining the JSON structure for those types which ultimately gets translated into type mappings.
5. Query approach & relevance tuning
The Elasticsearch query DSL is vast. At a high level you will deal with queries and filters depending on exactly what you need to do. You’ll want to avoid queries, if possible, and lean toward filters. They are much, much more performant.
More than just query design, you’ll want to figure out how you’re going to expose queries to the API. On a recent project we started by having our Java-based API layer translate developer-friendly query string params into Elasticsearch filters. We didn’t stick with that, though, because tuning and tweaking our queries required the API layer to be re-compiled and deployed. We now do everything with search templates which pulls our query logic out of the Java code and makes it easier to manage.
Understanding how to write efficient queries is one thing, but making them return the results that end-users expect is another. Once written, expect to spend some time tweaking analyzers and scoring so that the engine returns the right hits. If this is a particular concern for you take a look at the Relevant Search book from Manning.
6. Monitoring & Alerting
Be sure to factor in a completely separate “monitoring” cluster that will only be used to capture stats about the health of the cluster and alert you when something goes wrong. Two tools that work great for this are Marvel and Watcher.
Marvel keeps track of the health of the cluster and Sense (built-in to Marvel) is used to run ad hoc operations against the cluster. Marvel includes a dashboard that reports on the health of the cluster.
Elastic just released a new tool called Watcher. It watches for certain conditions and alerts you when those conditions are met. So when some stat (JVM heap, for example) reaches a threshold you can take some action (send an email, call a web hook, etc.).
Watcher isn’t just for monitoring the health of the cluster. Watcher can monitor searches against any index. In fact, Watcher can invoke any HTTP end point and then take action based on what comes back.
7. Node provisioning and config management
Once you have more than a handful of nodes it becomes challenging to keep every node in sync with regard to software versions, configuration, etc. There are a number of open source tools that can help with this. I’ve used both Chef and Ansible to help manage Elasticsearch clusters. By far, my favorite tool for this is Ansible. It automates upgrades and configuration propagation without requiring any additional software to be installed on any of the Elasticsearch nodes.
You may not see a huge need for automation now, but if you’re going to start small and grow, you’ll want to be able to grow quickly. Having a library of common tasks scripted with Ansible will allow you to go from bare server to fully-provisioned Elasticsearch node in minutes with no manual intervention.
In addition to automating installs and config changes, you’ll have a need for scheduling routine administrative tasks like copying an index or cleaning up old indices that Marvel and Watcher create daily. I use a “job server” that I built from open source components to do this. Cron jobs are also a common approach.
8. Backup and recovery
Properly tuned, indexing can run pretty fast even for very large data sets. So some people opt to simply re-index if they lose data. Elasticsearch has built-in “snapshot” functionality that can back up your indices. If you do something to handle scheduled operations (see “job server”, above) then taking regular snapshots easy to do. Relying on OS-level file system backups may be dicey once you have multiple nodes due to how the data is stored.
9. API & UI development
It is likely that you’ll put Elasticsearch behind an API layer that provides an agnostic API to applications that are leveraging your search cluster. You may also want to do some transformation of input or output before and after requests hit Elasticsearch. Exactly what you use for this is up to you–there are Elasticsearch clients for most popular languages and you can always just use the REST endpoints if needed. I’ve implemented this layer using Node.js, Java, and Lua and they each have pluses and minuses, as usual.
Everything you index into Elasticsearch is JSON. So any tool that can speak HTTP and post JSON can be used to work with the server. Elastic offers a tool called Marvel that embeds another tool called Sense (also available as a Chrome extension) that is extremely useful for doing this. Of course command-line tools like curl also work well.
Depending on the makeup of your team and the use case, you may find that writing a custom UI to manage the documents indexed into Elasticsearch, rather than using Sense or curl, is the way to go. This is just a web development task, thanks to the Elasticsearch REST API, but obviously it takes time that needs to be accounted for in your plans.
10. Data indexing
It is easy to index data into Elasticsearch. Depending on the data source and other factors, you might write this yourself or you can use another tool from Elastic called Logstash. Logstash can watch log files or other inputs and then efficiently index the data into your cluster.
Installing and running an out-of-the-box Elasticsearch cluster is easy. Making it work for your exact use case and keeping everything humming along takes a bit more effort. Hopefully this list has given you a rough idea of the areas where you’ll likely need to spend time as you move forward with your Elasticsearch project.