Tag: Tips

Clean up your Elasticsearch query logic with search templates

Photo credit: Marcin Wichary
Photo credit: Marcin Wichary

I stumbled onto Elasticsearch’s search templates feature on my last project and it turned out to be really useful. I remember being surprised I hadn’t seen it mentioned anywhere. I’ve asked around at the last couple of meetups I’ve attended and it turns out many people don’t know about search templates, so I thought it might make a good post.

I’m going to give some context as to why this feature was useful, then I’ll show you how to use it. If you don’t want or need the context, feel free to skip to the next section.

Context: Real world use case for search templates

For this particular project we were using Elasticsearch as a content service. A set of front-end Single Page Applications (SPAs) query the content service. The content service returns content objects as JSON that match some criteria. Components in the SPAs format the objects as needed.

The content service is a Java-based API that sits between the SPAs and Elasticsearch. The API abstracts the Elasticsearch details and adds some business logic regarding which content to return beyond simply matching the parameters specified by the front-end.

A simple example of one type of business logic the API adds is publish date and expiration date handling. All of our JSON objects in Elasticsearch have a publish date and some have an expiration date. When the front-end asks the API for content, we only want to return content that is current–in other words the current date has to be greater than or equal to the publish date and less than the expiration date if an expiration date is set.

If you leave this up to the front-end, and if the API is open, then anyone can get any content object, regardless of effectivity, which, in our case, is a bad thing. Even if the API was locked down, there’s no reason to make each front-end application duplicate the date handling logic. So the API handles that and other business rules around fetching content and constructing a response.

The native Elasticsearch API is Java, so building and executing queries in Java is a very natural thing to do. However, as the service evolved, the part of our code responsible for constructing the query was at risk of becoming unwieldy. We also started to identify new types of queries the front-end needed to execute that didn’t fit cleanly into our existing query-building logic.

In addition to identifying new types of queries the API needed to support, we began to see that the front-end applications would need to be able to provide more than just a flat list of key-value pairs–at the very least they would need to ask for content with parameters that included arrays and dictionaries as well.

The service had reached a point where it needed flexibility in the number and type of queries it could run and the parameters it could accept, but we didn’t want to expose the full power (and complexity) of the Elasticsearch Query DSL. Search templates to the rescue.

What is an Elasticsearch Search Template?

(This section contains embedded gists. If you can’t see them you may need to enable JavaScript. If all else fails, the gists live here.)

An Elasticsearch search template is kind of like a stored procedure in a relational database. Really, it’s just a normal query with replacement variables, aka template parameters. The templates are expressed using Mustache.

Here’s a simple example:

That example specified the template and the parameters in the same request. Obviously, if you’re going to do that you might as well not use a template.

What you’d rather do is put the template somewhere and then invoke it. You have two options. You can index the template or you can put the template on the file system.

Here’s how you index a template:

And then you can call it, like this:

If you’d rather put the template on the file system, it goes in $ES_HOME/config/scripts and is named template-id.mustache. Once you’ve deployed the template to every node in your cluster, you can call it, like this:

You don’t have to restart the node when you update a search template. Elasticsearch picks up the changes automatically. If you watch the log when you update a search template you should see something like:

[2016-01-14 18:02:39,797][INFO ][script] [node01] compiling script file [/opt/elasticsearch/jtpcluster01/config/scripts/tweets.mustache]

Including conditionals in your search template

Suppose we want to return all tweets unless a “since” parameter is provided. If since is specified, the query should do a date range against the timestamp property using the value provided. Mustache has some support for conditionals. Here’s how it looks in a search template:

This template will conditionally add the date range check only if the “since” parameter is provided.

Note: Be careful of spacing here. I like to put a space between my curly braces and the parameter. But if you do that in the conditional, mustache won’t recognize it.

To get the tweets for the last 30 days, you’d call the search template like this:

And to get all of the tweets you’d just omit the since parameter.

As your templates get more complex you might take a look at this tool. It allows you to quickly see how your templated queries will render given a set of parameters.

Using negation to implement if-then-else logic

Suppose that instead of returning all tweets we want to return just the last day of tweets unless the since parameter is specified. You’d like to use an if-then-else in the Mustache template. Else isn’t specifically supported by Mustache, but we can use negation to achieve the same thing.

This template keeps the clause that does the date check if since is specified, but now adds a default date check if it is not:

If the date is specified in the since parameter, it works as it did before. If not, only the last day of tweets will be returned.

Working with Arrays

Something that is kind of annoying is how to handle arrays. You can iterate over an array with Mustache fairly easily. But Mustache doesn’t have a mechanism for checking a position in an array such as “isLast” or “hasNext”, so if you need to do something like that, you’ll end up making your own construct.

For example, suppose we want to be able to pass in a list of user names to the search template to restrict the list of tweets to those specific users. The easy way to handle that in our query is to use a terms filter, like this:

{
  "terms": {
    "user": ["jeffpotts01", "elastic"]
  }
}

But that doesn’t let me show how to work with arrays so I’m going to contrive the example to say that if a user list is provided, we need to add an “or” clause to the query with one term filter per user name.

To do that, we’ll require the list of users to be provided as a search template like this:

The template can check for the “userList” parameter to know whether or not to build the “or” clause. Then it can iterate over the “users” array, plucking out the name.

As the template iterates over the array, it needs to know whether or not it is on the last user. Otherwise it has no way of knowing whether or not to add the comma separating the term filters. Mustache can’t help us so the search parameter will include “isLast” set to true for the last user in the list.

Here’s a template that can handle the array of user names:

The result of calling the template above with the example user list is a query that looks like this:

With those simple constructs you ought to be able to create some very elaborate search templates.

Invoking search templates from the Java API

Back in the service layer, it is easy to invoke a search template with the Java API. Here’s how that looks:

In the real API, those params are getting POSTed to the endpoint.

Query changes without a code deployment

With search templates, we can add new queries and modify existing queries by creating and modifying search templates. This means for many adjustments, we don’t have to build and deploy the custom content service API code. And troubleshooting is easier too because we can invoke the same search template the service is using directly and not worry about whether or not the Java API is building the query we expect.

So the next time you find yourself writing code to construct an Elasticsearch query, ask yourself if it would make more sense to externalize it as a search template.

10 Things to Consider When Planning Your Elasticsearch Project

elastic_logo_color_horizontalI am seeing a lot of interest in Elasticsearch from clients and colleagues. Elasticsearch is an open source search engine that is commercially supported by a company called Elastic. It’s used for web search, log analysis, and big data analytics. You’ll often see it compared with Apache Solr. Both depend on Apache Lucene for low-level indexing and analysis. People like Elasticsearch because it is easy to install, scales out to hundreds of nodes with no additional software needed, and is easy to work with thanks to its built-in RESTful API.

Multiple folks have asked me what they need to think about when leveraging Elasticsearch as part of their solution, so I thought I’d summarize those thoughts and share them here. This isn’t a detailed technical list but is more like a set of buckets that need time and attention.

1. Cluster sizing

The nice thing about Elasticsearch is how easy it is to scale out. But you should still have an idea of the near- and medium term hardware footprint. Indexing and querying time can vary depending on many factors and every installation is different. You’ll want to establish your “unit of scale” early so you know roughly what you’ll have to do to get a target level of throughput and CPU utilization.

I wrote a blog post on using Apache JMeter to load-test Elasticsearch, which works well for establishing how much load your cluster can take and where the bottlenecks are.

2. Cluster footprint

Related to cluster sizing is your cluster footprint. Elasticsearch nodes can be master nodes, data nodes, client nodes, or some combination. Most people opt for dedicated master nodes (3 at a minimum) and then some number of data and client nodes.

I like using dedicated nodes for everything because it separates responsibilities and lets you optimize each type of node for its particular workload. For example, I’ve seen a performance boost by separating client and data nodes. The client nodes handle the incoming HTTP requests which leaves the data nodes to service the queries.

Like sizing, the footprint that works well for you depends on what you’re doing, so use something like JMeter to test repeatedly until you get it right.

3. Security

You’ll need to secure your Elasticsearch cluster, both between the application/API and Elasticsearch layers and between the Elasticsearch layer and your internal network. Shield, which is a paid product from Elastic, can take you a lot of the way here and if you pay for support from Elastic, Shield is included.

One of my projects uses Shield to provide LDAP authentication, to encrypt all data between Elasticsearch nodes with SSL, and to control authorization for all of the indices in the cluster. We’ve been happy with it so far.

If you can’t justify a support subscription you’ll need to do something else to prevent unauthorized access to your cluster. Using something like nginx as a proxy is a common choice.

4. Index/Alias/Type Mapping approach

You might call this your data partitioning and data modeling approach. You should figure out early what your approach to indices and aliases will be. You’ll definitely want to use aliases–that’s a given. Aliases insulate your app from index name changes among other things. But some thought also needs to be given to how you partition data across indices.

You’ll also need to identify how you’ll leverage type mappings. Elasticsearch is schema-less but type mappings of some kind are almost always needed so that Elasticsearch knows how to index the data (longs versus dates versus strings, for example).

I’m building a dynamic content service on top of Elasticsearch for one of my clients. They have many different types of content that will be indexed in Elasticsearch and returned to their e-commerce app as JSON chunks. A lot of time is going into defining the JSON structure for those types which ultimately gets translated into type mappings.

It is worth spending some time looking at index templates, default mappings, and dynamic mappings and thinking about how you will manage your mappings as the number of types grows.

5. Query approach & relevance tuning

The Elasticsearch query DSL is vast. At a high level you will deal with queries and filters depending on exactly what you need to do. You’ll want to avoid queries, if possible, and lean toward filters. They are much, much more performant.

More than just query design, you’ll want to figure out how you’re going to expose queries to the API. On a recent project we started by having our Java-based API layer translate developer-friendly query string params into Elasticsearch filters. We didn’t stick with that, though, because tuning and tweaking our queries required the API layer to be re-compiled and deployed. We now do everything with search templates which pulls our query logic out of the Java code and makes it easier to manage.

Understanding how to write efficient queries is one thing, but making them return the results that end-users expect is another. Once written, expect to spend some time tweaking analyzers and scoring so that the engine returns the right hits. If this is a particular concern for you take a look at the Relevant Search book from Manning.

6. Monitoring & Alerting

Be sure to factor in a completely separate “monitoring” cluster that will only be used to capture stats about the health of the cluster and alert you when something goes wrong. Two tools that work great for this are Marvel and Watcher.

Marvel keeps track of the health of the cluster and Sense (built-in to Marvel) is used to run ad hoc operations against the cluster. Marvel includes a dashboard that reports on the health of the cluster.

Elastic just released a new tool called Watcher. It watches for certain conditions and alerts you when those conditions are met. So when some stat (JVM heap, for example) reaches a threshold you can take some action (send an email, call a web hook, etc.).

Watcher isn’t just for monitoring the health of the cluster. Watcher can monitor searches against any index. In fact, Watcher can invoke any HTTP end point and then take action based on what comes back.

7. Node provisioning and config management

Once you have more than a handful of nodes it becomes challenging to keep every node in sync with regard to software versions, configuration, etc. There are a number of open source tools that can help with this. I’ve used both Chef and Ansible to help manage Elasticsearch clusters. By far, my favorite tool for this is Ansible. It automates upgrades and configuration propagation without requiring any additional software to be installed on any of the Elasticsearch nodes.

You may not see a huge need for automation now, but if you’re going to start small and grow, you’ll want to be able to grow quickly. Having a library of common tasks scripted with Ansible will allow you to go from bare server to fully-provisioned Elasticsearch node in minutes with no manual intervention.

In addition to automating installs and config changes, you’ll have a need for scheduling routine administrative tasks like copying an index or cleaning up old indices that Marvel and Watcher create daily. I use a “job server” that I built from open source components to do this. Cron jobs are also a common approach.

8. Backup and recovery

Properly tuned, indexing can run pretty fast even for very large data sets. So some people opt to simply re-index if they lose data. Elasticsearch has built-in “snapshot” functionality that can back up your indices. If you do something to handle scheduled operations (see “job server”, above) then taking regular snapshots easy to do. Relying on OS-level file system backups may be dicey once you have multiple nodes due to how the data is stored.

9. API & UI development

It is likely that you’ll put Elasticsearch behind an API layer that provides an agnostic API to applications that are leveraging your search cluster. You may also want to do some transformation of input or output before and after requests hit Elasticsearch. Exactly what you use for this is up to you–there are Elasticsearch clients for most popular languages and you can always just use the REST endpoints if needed. I’ve implemented this layer using Node.js, Java, and Lua and they each have pluses and minuses, as usual.

Everything you index into Elasticsearch is JSON. So any tool that can speak HTTP and post JSON can be used to work with the server. Elastic offers a tool called Marvel that embeds another tool called Sense (also available as a Chrome extension) that is extremely useful for doing this. Of course command-line tools like curl also work well.

Depending on the makeup of your team and the use case, you may find that writing a custom UI to manage the documents indexed into Elasticsearch, rather than using Sense or curl, is the way to go. This is just a web development task, thanks to the Elasticsearch REST API, but obviously it takes time that needs to be accounted for in your plans.

10. Data indexing

It is easy to index data into Elasticsearch. Depending on the data source and other factors, you might write this yourself or you can use another tool from Elastic called Logstash. Logstash can watch log files or other inputs and then efficiently index the data into your cluster.

Summary

Installing and running an out-of-the-box Elasticsearch cluster is easy. Making it work for your exact use case and keeping everything humming along takes a bit more effort. Hopefully this list has given you a rough idea of the areas where you’ll likely need to spend time as you move forward with your Elasticsearch project.

Keeping your Alfresco web scripts DRY

One of my teammates thoroughly drenched our under-the-desk development server with a giant cup of coffee once. Somehow, disaster was avoided, although the client’s carpet had a nice coffee-stain outline of the server’s footprint long after the app rolled out which afforded the rest of us endless opportunities to mercilessly haze the clumsy coder.

Keeping your Alfresco web scripts DRY is actually not about making your developers use spill-proof travel mugs, or better yet, virtual machines. DRY is a coding philosophy or principle that stands for Don’t Repeat Yourself. There are all kinds of resources available that go into more detail, and the principle is more broad than simply avoiding duplicate code, but that’s what I want to focus on here.

There are three techniques you should be using to avoid repeating yourself when writing web scripts: web script configuration, JavaScript libraries accessed via import, and FreeMarker macros accessed via import.

Web Script Configuration

Added in 3.0, a web script configuration file is an XML file that contains arbitrary settings for your web script. It’s accessible from both your controller and your view. Building on the hello world web script from the Alfresco Developer Guide, you could add a configuration script named “helloworld.get.config.xml” that contained:


<properties>
<title>Hello World</title>
</properties>

You could then access the “title” element from a JavaScript controller using the built-in E4X library:

var s = new XML(config.script);
logger.log(s.title);

And, you could also grab the title from the FreeMarker view:

<title>${config.script["properties"]["title"]}</title>

The web script configuration lets you separate your configuration from your controller logic. And because you can get to it from both the controller and the view, you don’t have to stick configuration info into your model.

This example showed a script-specific configuration, but global configuration is also possible. See the Alfresco Wiki Page on Web Scripts for more details.

JavaScript Import

If your web script controllers are written in JavaScript, at some point you will find yourself writing JavaScript functions that you’d like to share across multiple web scripts. A common example is logic that builds a query string, executes the query, and returns the results. You don’t want to repeat the code that does that across multiple controllers. That’s what the import statement is for.

The syntax is easy:

<import resource="classpath:alfresco/extension/scripts/status.js">

This needs to be the first line of your JavaScript controller file. This example shows a JavaScript library being imported from the classpath, but you can also import from the repository by name or by node reference. See the Alfresco JavaScript API Wiki Page for more details.

FreeMarker Import

Of the three this is the one I see ignored most often. Let’s take the Optaros-developed microblogging component for Share. Its basic data entity is called a “Status” object. So web scripts on the repository tier return JSON that might have a single Status object or a list of Status objects. That’s two different web scripts and two different views, but the difference between a list of objects and a single object is really just the list-related wrapper–in both cases, the individual Status object JSON is identical. I’ve seen people simply copy-and-paste the FreeMarker from the single-object template to the list template and then just wrap that with the list markup. That’s bad. If you ever change how a Status object is structured, you’ve got to change it in (at least) two places. (Or it’ll be someone that comes after you that has to do it which makes it even worse. If you aren’t following the analogy, your redundant code is the coffee stain).

Instead of duplicating the logic, use a FreeMarker import and define a macro that formats the object. Then you can call the macro any time you need your object formatted. Here’s how it works for Status.

A FreeMarker file called “status.lib.ftl” contains the macros that format Status objects. It lives with the rest of the web script files and looks like this:


<#assign datetimeformat="EEE, dd MMM yyyy HH:mm:ss zzz">
<#--
Renders a status node as a JSON object
-->
<#macro statusJSON status>
<#escape x as jsonUtils.encodeJSONString(x)>
{
"siteId" : "${status.properties["optStatus:siteId"]!''}",
"user" : "${status.properties["optStatus:user"]!''}",
"message" : "${status.properties["optStatus:message"]!''}",
"prefix" : "${status.properties["optStatus:prefix"]!''}",
"mood" : "${status.properties["optStatus:mood"]!''}",
"complete" : ${(status.properties["optStatus:complete"]!'false')?string},
"created" : "${status.properties["cm:created"]?string(datetimeformat)}",
"modified" : "${status.properties["cm:modified"]?string(datetimeformat)}"
}
</#escape>
</#macro>
<#--
Renders a status node as HTML
-->
<#macro statusHTML status>
SiteID: ${status.properties["optStatus:siteId"]!''}<br />
User: ${status.properties["optStatus:user"]!''}<br />
Message: ${status.properties["optStatus:message"]!''}<br />
Prefix: ${status.properties["optStatus:prefix"]!''}<br />
Mood: ${status.properties["optStatus:mood"]!''}<br />
Complete: ${(status.properties["optStatus:complete"]!'false')?string}<br />
Created: ${status.properties["cm:created"]?string(datetimeformat)}<br />
Modified: ${status.properties["cm:modified"]?string(datetimeformat)}<br />
</#macro>

There’s one macro for JSON and another for HTML. It still bothers me that the same basic structure is repeated, but at least they are both in the same file and it feels better knowing that this is the one and only place where the data structure for a Status object is defined.

The FreeMarker that returns Status objects as JSON resides in status.get.json.ftl and looks like this:

<#import "status.lib.ftl" as statusLib/>
{
"items" : [
<#list results as result>
<@statusLib.statusJSON status=result />
<#if result_has_next>,</#if>
</#list>
]
}

You can see the import statement followed by the JSON that sets up the list, and then the call to the FreeMarker macro to format each Status object in the list.

When someone posts a new Status, the new Status object is returned. Rather than repeat the JSON structure for a Status object, the status.post.json.ftl view simply calls the same macro that got called from the GET:

<#import "status.lib.ftl" as statusLib/>
{
"status" : <@statusLib.statusJSON status=result />
}

Now if the data structure for a Status object ever needs to change, it only has to be changed in one place.

Don’t Repeat Yourself

Take a look at your web scripts. Eliminate your duplicate code. And keep a lid on your mocha frappuccino.

Productively running multiple versions of Alfresco

How do you juggle multiple versions of Alfresco on your local machine? At any given time, I’ve got at least three versions running: The latest Enterprise release, the Enterprise head, and the Community head. If one or more active Optaros clients aren’t yet on the latest release, those older releases are running as well.

There’s always more than one way to skin a cat but here’s how I handle this problem. First, I want to use a single Tomcat instance, but I don’t want all of my Alfresco versions running at the same time. Second, I want to use the same MySQL instance running with version-specific databases to keep them separate. Third, I want separate, version-specific data directories.

The solution is to name your databases and all of your version specific directories with a naming convention, and then use shell scripts to delete and create soft links in the appropriate locations that point to the version specific directories. (If you are running Windows you can do symbolic links too–they are called Junctions).

In an Alfresco install there are four directories that you need to address:

  • The data directory
  • The web application directory
  • The shared extension directory
  • The virtualization server directory

Decide on a naming convention. For my databases, I use alfrescoXXn where XX is the version number and n is the network abbreviation (“c” for community, “e” for enterprise, etc.). So my database for 2.2 Enterprise would be alfresco22e. For directories, I use alfresco-X.X-nnnn. So, for example, the directories for Alfresco 2.2 Enterprise would be:

  • /home/jpotts/alfresco/alfresco-2.2-enterprise/data
  • /home/jpotts/alfresco/alfresco-2.2-enterprise/webapp
  • /home/jpotts/alfresco/alfresco-2.2-enterprise/extension
  • /home/jpotts/alfresco/alfresco-2.2-enterprise/virtual

When I want to switch between versions, I just call a shell script with the version as the argument. It takes care of deleting the existing links and creating the new:

echo Switching to $1...
echo Switching webapp...
rm $TOMCAT_HOME/webapps/alfresco
ln -s /home/jpotts/alfresco/alfresco-$1/webapp $TOMCAT_HOME/webapps/alfresco
echo Switching extension directory...
rm $TOMCAT_HOME/shared/classes/alfresco/extension
ln -s /home/jpotts/alfresco/alfresco-$1/extension $TOMCAT_HOME/shared/classes/alfresco/extension
echo Clearing Tomcat work directory...
rm -rf $TOMCAT_HOME/work/Catalina/localhost/alfresco
echo Clearing Tomcat temp directory...
rm -rf $TOMCAT_HOME/temp/*
echo Switching virtual tomcat directory...
rm $VIRTUAL_TOMCAT_HOME
ln -s /home/jpotts/alfresco/alfresco-$1/virtual $VIRTUAL_TOMCAT_HOME

Note that in this set up, the data directory and the Alfresco database don’t have to be dealt with directly because they are referenced by the custom-repository.properties file residing in the version-specific extension directory.

While you’re at it, why not make it easier to clear out the repository? Set up a “clean” script that cleans out your data directory and then runs the db_remove and db_setup SQL scripts. Remember that you’ll either need version specific copies of these scripts or you’ll need to modify them to accept the version-specific database as an argument.

With this setup in place, you can set up new versions, switch between versions, and start over with a clean repo quickly and easily.