Have you checked out the Apache Solr project yet? It’s pretty cool. It’s essentially a search server (deployed as a web app into a servlet container) that sits on top of Lucene. Solr makes it super easy to get content into and out of Lucene via its HTTP and JSON APIs.
Recently, for a prospective Optaros client, we put together a little demo to show how Alfresco WCM could integrate with Solr to provide search and personalization for a web site managed within Alfresco. Here’s what we did at a high level:
- Create an Alfresco web form and XSLT for my web content as usual.
- Create an additional XSLT (or Freemarker) template to convert the XML content to the Solr format. This gets configured as an additional presentation template associated with the web form.
- Wrote a JSP to aggregate the Solr XML for all of the published content.
- Wrote a servlet to call the JSP every X seconds. It takes the response and posts it to Solr. That’s how the Alfresco content gets into the index.
This setup allowed web content to get indexed by the Solr search engine upon its creation. Web site users (either using the web site in the virtualized sandbox or on the production web site) could then query the content.
The web site was a mix of static HTML and JSPs. The JSPs used custom taglibs to call “Solr Search” widgets in the right spot on the page. This was the first time I had used Alfresco’s virtualization to run a real web application (as opposed to static content). The preview release of 2.0 I was using seemed to have some significant cacheing issues. Hopefully those are resolved in the production release. Other than that, it was easy to see how technical and non-technical content managers could leverage Alfresco virtualization to collaborate together to develop and manage a dynamic web site.
Before using this approach in production, I would need to think about the best way to handle deletes. In the demo, once content got into the index, it didn’t come out if the associated content was removed from Alfresco. As far as Solr goes, it is easy to get the content deleted from the index–it’s a simple HTTP post. The trick is where in Alfresco to put that call.