Alfresco 3.1 clustering easier with JGroups

Optaros has worked on some of the largest and most complex Alfresco implementations anywhere. Projects where multi-node read-write clusters are required have been particularly challenging. So when Alfresco announced clustering improvements in 3.1 my interest was piqued.

I decided I’d do a simple test: Get a two-node read-write Alfresco 3.1 cluster running using a shared MySQL database and a shared file store (as opposed to a replicated database and a replicated file store). The process is mostly documented here but I thought I’d capture the steps I went through in case someone finds them helpful.

Prepare the virtual machines

If you already have virtual or physical machines ready to go, go on to “Setup the content store & database”.

I already had an Ubuntu server virtual machine image with everything I needed for the test. I upgraded it to Alfresco 3.1, cleared out the repository, and verified that everything was working okay. In order to share my data directory via NFS I did need to use apt-get to install nfs-kernel-server, nfs-common, and portmap, but that’s no big deal.

Once I had the first image all set it was time to create a second. I’m using Sun’s VirtualBox for virtualization. It doesn’t have a “clone” command in the UI and you can’t simply do a file copy of the VDI file. Instead, you have to use VBoxManage on the command line. The form of the command that uses the source VDI file name and target file name didn’t work, but using the source VDI UUID did:

BoxManage clonevdi 19a7646e-d5cb-4e01-90fd-2bcd556dc1d5 "Ubuntu Test Server Clone.vdi"

It was weird that I had to use the source UUID instead of the file name, but I got what I wanted.

Setup the network

I used VirtualBox “host only” networking for ease of setup. This allowed my host machine to see the images and the images to see each other.

My server image was originally set up to use DHCP. That appeared to be giving Alfresco and JGroups trouble so I converted the images to use static IP addresses, unique host names, and updated hosts files (I didn’t want to set up DNS). That left me with three machines (one host and two virtual machines called node1 and node2) that could ping each other by name.

Setup the content store & database

At this point I’ve got two identical Alfresco servers, but each have their own database and data store. For my test, they needed to point to the same database. They also needed to share the content store but have their own local Lucene index.

For this test I decided to use the database and file system on node1 for both nodes. In real life, that wouldn’t be a good setup because losing node1 would bring down the whole cluster. For a shared db/file system setup, you’d want separate nodes, each clustered, for the db and file system.

My Alfresco content store is in “/srv”. I wanted to use NFS to share the content store with the other nodes in my cluster, so I edited /etc/exports to add a new entry for the “/srv” directory. I used an IP address range here but I could have used explicit host names.

/srv 192.168.56.0/25(rw,no_root_squash,async)

You have to restart the nfs-kernel to make that change take effect:

/etc/init.d/nfs-kernel-server restart

Then, I split out the content and index stores into three directories:

/srv/alfresco-3.1-enterprise
/srv/alfresco-3.1-enterprise-local-index
/srv/alfresco-3.1-enterprise-local-index-backup

And updated custom-repository.properties accordingly:

dir.root=/srv/alfresco-3.1-enterprise
dir.indexes=/srv/alfresco-3.1-enterprise-local-index
dir.indexes.backup=/srv/alfresco-3.1-enterprise-local-index-backup

The second node will access the database remotely, so MySQL needed to know about that:

grant all on alfresco31e.* to 'alfresco31e'@'192.168.56.4' identified by 'alfresco31e' with grant option;

Later it seemed that node1 was accessing MySQL via its static IP address rather than localhost as it used to. Rather than figure out why or where that’s config’d, I just ran the same command as the above for node1′s static IP.

With node1 all set, it was time to give node2 some attention…

My original plan was to NFS mount the node1 data directory as something like “/srv/alfresco-labs-3d-shared” because using the same directory name I would have used on a single node seemed confusing. As it turned out, I think Alfresco must keep track of that data directory name because it complained that my “dir.root” was set incorrectly. So I wound up using the same directory names that I used on node1 and making the same update to custom-repository.properties:

dir.root=/srv/alfresco-3.1-enterprise
dir.indexes=/srv/alfresco-3.1-enterprise-local-index
dir.indexes.backup=/srv/alfresco-3.1-enterprise-local-index-backup

Then I mounted the data directory:

mount 192.168.56.3:/srv/alfresco-3.1-enterprise /srv/alfresco-3.1-enterprise

I didn’t do it, but it would be smart to update /etc/fstab so that the data directory would be automatically mounted on server startup.

With that the data directories are all set. Telling node2 to use the database on node1 instead of localhost was a simple custom-repository.properties change:

db.url=jdbc:mysql://node1.alfresco.jpotts.com/alfresco31e

Now node1 and node2 are pointing to the same content store and database, and each have their own Lucene index. The last step was to configure the cluster.

Configure the cluster

Configuring the cluster involved enabling the sample ehcluster-config.xml and making a few small changes to custom-repository.properties.

To enable the ehcluster-config, I copied the ehcluster-config.xml.sample file that came with the sample extensions to ehcluster-config.xml to my extensions directory. No other changes were needed in this particular case.

In custom-repository.properties, you have to assign a cluster name to activate the cluster. The index recovery mode needs to be set to AUTO so the indexes stay in sync:

alfresco.cluster.name=testcluster
index.recovery.mode=AUTO

In Alfresco 3.1, Alfresco uses JGroups to discover and coordinate cluster members. It has configurable protocols it uses for cluster member communication. The default is set to UDP but I couldn’t get that to work, so I changed it to TCP. I also found that I had to list the hosts in my cluster in order for the two nodes to find each other:

alfresco.jgroups.defaultProtocol=TCP
alfresco.tcp.initial_hosts=node1.alfresco.jpotts.com[7800],node2.alfresco.jpotts.com[7800]

As you can see, most of the work was really about networking and data setup. The cluster configuration itself is actually pretty minimal.

Test the cluster

Before starting Tomcat on the two nodes, I enabled a log4j logger so I could see nodes join and leave the cluster:

log4j.logger.org.alfresco.enterprise.repo.cache.jgroups=INFO

After starting up Tomcat, I eventually saw this in catalina.out:

06:24:52,043 INFO [repo.jgroups.AlfrescoJGroupsChannelFactory]
Created JChannelFactory:
Cluster Name: testcluster
Stack Mapping: {DEFAULT=TCP}
Configuration: file:/opt/apache/apache-tomcat-5.5.27/webapps/alfresco/WEB-INF/classes/alfresco/jgroups-default.xml

——————————————————-
GMS: address is 192.168.56.3:7800
——————————————————-

When the second node joined the cluster, the first node knew about it:

06:26:21,241 INFO [cache.jgroups.JGroupsKeepAliveHeartbeatReceiver]
New cluster view with additional members:
Last View: null
New View: [192.168.56.3:7800|1] [192.168.56.3:7800, 192.168.56.4:7800]

Once the nodes could see each other it was time to test it out from an end-user perspective. Obviously, in production you’ll have a load-balancer in front of these two nodes. For testing the cluster, though, you want to be able to hit each node specifically. I used two different browsers on the host machine logging in as two different users. There are some short test scenarios on the Alfresco wiki. In addition to those, you might want to:

  • Create, delete, and update content while a second node is shut down. Start the second node and see if you can navigate to, search for, and read the properties of content as you would expect. Note that it may take a few seconds for the cache and Lucene index to update.
  • Check out content in one browser and verify that it is checked out on the other.
  • Simultaneously edit content properties.
  • Open the edit properties page in one browser and delete the object in another.

That’s it
In a real-world production environment you often have numerous networking issues to deal with that makes this more of a headache, but hopefully this gives you an idea of the basic steps involved, and shows you how to get familiar with it by setting up your own test cluster using virtual machines.

This entry was posted in Alfresco, Content Management and tagged , , , . Bookmark the permalink.