Cluster | ECM Architect

I am way past due getting this blog post out. We’ve had multiple discussions on the topic in #alfresco on IRC over the last several weeks, but I want to make sure those of you who aren’t yet hanging out in IRC regularly are aware of some changes related to Alfresco clustering that have already happened with the 4.2.a release. In general, I would prefer to share this stuff way ahead of any release in which it takes effect, but that didn’t happen this time and I’m sorry about that. I promise to try to do better!

With that out of the way, let me shed some light on some recent changes that may affect some of you running Community Edition…

Until recently, clustering in Alfresco has been implemented using a combination of three main technologies: Ehcache, Hazelcast, and JGroups. If you’ve looked at it lately, you may have noticed that JGroups has been removed from the repository source code. What’s going on is that we are consolidating our clustering implementation on Hazelcast because it can handle everything we need to make clustering work. (On a side note, I believe this is one of those improvements that we’ve made to our on-premise software as a result of lessons learned running our own large-scale Alfresco implementation in the cloud).

So that explains what’s going on with clustering in the Enterprise Edition. But if you looked closely at the Community Edition source code, you may have noticed that Hazelcast is no longer included at all. In fact, all clustering related code (org.alfresco.repo.cluster.*) has been removed, including configuration files like cluster-context.xml and hazelcast/* and some changes to existing Spring configuration.

Now, at this point, some of you are probably thinking, “There was clustering code in Community Edition? I didn’t think clustering was supported in Community Edition.” You are correct. Clustering has never been supported in Community Edition. Community Edition is not commercially supported by Alfresco at all. But a lot of the pieces you need to make clustering work in Community Edition have been available until now, and there may be people out there who chose to get it working themselves rather than pay for an Enterprise subscription that includes support for clustering. These recent changes make it much harder to do that.

What I want to make sure everyone is clear on is that the removal of the ability to cluster Community Edition does not represent a shift in our philosophy on what should be in Community Edition versus what should be in Enterprise Edition. The principles John Newton outlined in his blog post, “Building a stronger open source product” back in 2009 still apply today. In short, functionality that supports large-scale rollouts (like clustering) or that depend on paid “Enterprise” software (like Oracle and WebSphere) should be Enterprise-only while everything else should be available to the community.

So rather than a change in philosophy, the removal of the clustering code from Community Edition simply implements the existing philosophy more explicitly. To be blunt, if your implementation is critical enough to require the 7×24 up-time an active-active, multi-node cluster provides, you should be able to justify an Enterprise Edition subscription. If you have high availability requirements but your rollout is relatively small or cost is an issue, perhaps Alfresco in the Cloud will be a fit.

Of the estimated 100,000+ installations of Community Edition currently up-and-running, my hunch is that this change affects only an infinitesimal fraction. But if you have feedback on this change, please do let your voice be heard, either here or by sending me email directly at jeff dot potts at alfresco dot com.

Optaros has worked on some of the largest and most complex Alfresco implementations anywhere. Projects where multi-node read-write clusters are required have been particularly challenging. So when Alfresco announced clustering improvements in 3.1 my interest was piqued.

I decided I’d do a simple test: Get a two-node read-write Alfresco 3.1 cluster running using a shared MySQL database and a shared file store (as opposed to a replicated database and a replicated file store). The process is mostly documented here but I thought I’d capture the steps I went through in case someone finds them helpful.

Prepare the virtual machines

If you already have virtual or physical machines ready to go, go on to “Setup the content store & database”.

I already had an Ubuntu server virtual machine image with everything I needed for the test. I upgraded it to Alfresco 3.1, cleared out the repository, and verified that everything was working okay. In order to share my data directory via NFS I did need to use apt-get to install nfs-kernel-server, nfs-common, and portmap, but that’s no big deal.

Once I had the first image all set it was time to create a second. I’m using Sun’s VirtualBox for virtualization. It doesn’t have a “clone” command in the UI and you can’t simply do a file copy of the VDI file. Instead, you have to use VBoxManage on the command line. The form of the command that uses the source VDI file name and target file name didn’t work, but using the source VDI UUID did:

BoxManage clonevdi 19a7646e-d5cb-4e01-90fd-2bcd556dc1d5 "Ubuntu Test Server Clone.vdi"

It was weird that I had to use the source UUID instead of the file name, but I got what I wanted.

Setup the network

I used VirtualBox “host only” networking for ease of setup. This allowed my host machine to see the images and the images to see each other.

My server image was originally set up to use DHCP. That appeared to be giving Alfresco and JGroups trouble so I converted the images to use static IP addresses, unique host names, and updated hosts files (I didn’t want to set up DNS). That left me with three machines (one host and two virtual machines called node1 and node2) that could ping each other by name.

Setup the content store & database

At this point I’ve got two identical Alfresco servers, but each have their own database and data store. For my test, they needed to point to the same database. They also needed to share the content store but have their own local Lucene index.

For this test I decided to use the database and file system on node1 for both nodes. In real life, that wouldn’t be a good setup because losing node1 would bring down the whole cluster. For a shared db/file system setup, you’d want separate nodes, each clustered, for the db and file system.

My Alfresco content store is in “/srv”. I wanted to use NFS to share the content store with the other nodes in my cluster, so I edited /etc/exports to add a new entry for the “/srv” directory. I used an IP address range here but I could have used explicit host names.

/srv 192.168.56.0/25(rw,no_root_squash,async)

You have to restart the nfs-kernel to make that change take effect:

/etc/init.d/nfs-kernel-server restart

Then, I split out the content and index stores into three directories:

/srv/alfresco-3.1-enterprise /srv/alfresco-3.1-enterprise-local-index /srv/alfresco-3.1-enterprise-local-index-backup

And updated custom-repository.properties accordingly:

dir.root=/srv/alfresco-3.1-enterprise dir.indexes=/srv/alfresco-3.1-enterprise-local-index dir.indexes.backup=/srv/alfresco-3.1-enterprise-local-index-backup

The second node will access the database remotely, so MySQL needed to know about that:

grant all on alfresco31e.* to 'alfresco31e'@'192.168.56.4' identified by 'alfresco31e' with grant option;

Later it seemed that node1 was accessing MySQL via its static IP address rather than localhost as it used to. Rather than figure out why or where that’s config’d, I just ran the same command as the above for node1’s static IP.

With node1 all set, it was time to give node2 some attention…

My original plan was to NFS mount the node1 data directory as something like “/srv/alfresco-labs-3d-shared” because using the same directory name I would have used on a single node seemed confusing. As it turned out, I think Alfresco must keep track of that data directory name because it complained that my “dir.root” was set incorrectly. So I wound up using the same directory names that I used on node1 and making the same update to custom-repository.properties:

dir.root=/srv/alfresco-3.1-enterprise dir.indexes=/srv/alfresco-3.1-enterprise-local-index dir.indexes.backup=/srv/alfresco-3.1-enterprise-local-index-backup

Then I mounted the data directory:

mount 192.168.56.3:/srv/alfresco-3.1-enterprise /srv/alfresco-3.1-enterprise

I didn’t do it, but it would be smart to update /etc/fstab so that the data directory would be automatically mounted on server startup.

With that the data directories are all set. Telling node2 to use the database on node1 instead of localhost was a simple custom-repository.properties change:

db.url=jdbc:mysql://node1.alfresco.jpotts.com/alfresco31e

Now node1 and node2 are pointing to the same content store and database, and each have their own Lucene index. The last step was to configure the cluster.

Configure the cluster

Configuring the cluster involved enabling the sample ehcluster-config.xml and making a few small changes to custom-repository.properties.

To enable the ehcluster-config, I copied the ehcluster-config.xml.sample file that came with the sample extensions to ehcluster-config.xml to my extensions directory. No other changes were needed in this particular case.

In custom-repository.properties, you have to assign a cluster name to activate the cluster. The index recovery mode needs to be set to AUTO so the indexes stay in sync:

alfresco.cluster.name=testcluster index.recovery.mode=AUTO

In Alfresco 3.1, Alfresco uses JGroups to discover and coordinate cluster members. It has configurable protocols it uses for cluster member communication. The default is set to UDP but I couldn’t get that to work, so I changed it to TCP. I also found that I had to list the hosts in my cluster in order for the two nodes to find each other:

alfresco.jgroups.defaultProtocol=TCP alfresco.tcp.initial_hosts=node1.alfresco.jpotts.com[7800],node2.alfresco.jpotts.com[7800]

As you can see, most of the work was really about networking and data setup. The cluster configuration itself is actually pretty minimal.

Test the cluster

Before starting Tomcat on the two nodes, I enabled a log4j logger so I could see nodes join and leave the cluster:

log4j.logger.org.alfresco.enterprise.repo.cache.jgroups=INFO

After starting up Tomcat, I eventually saw this in catalina.out:

06:24:52,043 INFO [repo.jgroups.AlfrescoJGroupsChannelFactory] Created JChannelFactory: Cluster Name: testcluster Stack Mapping: {DEFAULT=TCP} Configuration: file:/opt/apache/apache-tomcat-5.5.27/webapps/alfresco/WEB-INF/classes/alfresco/jgroups-default.xml

——————————————————-
GMS: address is 192.168.56.3:7800
——————————————————-

When the second node joined the cluster, the first node knew about it:

06:26:21,241 INFO [cache.jgroups.JGroupsKeepAliveHeartbeatReceiver] New cluster view with additional members: Last View: null New View: [192.168.56.3:7800|1] [192.168.56.3:7800, 192.168.56.4:7800]

Once the nodes could see each other it was time to test it out from an end-user perspective. Obviously, in production you’ll have a load-balancer in front of these two nodes. For testing the cluster, though, you want to be able to hit each node specifically. I used two different browsers on the host machine logging in as two different users. There are some short test scenarios on the Alfresco wiki. In addition to those, you might want to:

Create, delete, and update content while a second node is shut down. Start the second node and see if you can navigate to, search for, and read the properties of content as you would expect. Note that it may take a few seconds for the cache and Lucene index to update.
Check out content in one browser and verify that it is checked out on the other.
Simultaneously edit content properties.
Open the edit properties page in one browser and delete the object in another.

That’s it
In a real-world production environment you often have numerous networking issues to deal with that makes this more of a headache, but hopefully this gives you an idea of the basic steps involved, and shows you how to get familiar with it by setting up your own test cluster using virtual machines.

ECM Architect

Tag: Cluster

What’s going on with Alfresco clustering?

Alfresco 3.1 clustering easier with JGroups