June 8, 2009

Alfresco 3.1 clustering easier with JGroups

Optaros has worked on some of the largest and most complex Alfresco implementations anywhere. Projects where multi-node read-write clusters are required have been particularly challenging. So when Alfresco announced clustering improvements in 3.1 my interest was piqued.

I decided I’d do a simple test: Get a two-node read-write Alfresco 3.1 cluster running using a shared MySQL database and a shared file store (as opposed to a replicated database and a replicated file store). The process is mostly documented here but I thought I’d capture the steps I went through in case someone finds them helpful.

Prepare the virtual machines

If you already have virtual or physical machines ready to go, go on to “Setup the content store & database”.

I already had an Ubuntu server virtual machine image with everything I needed for the test. I upgraded it to Alfresco 3.1, cleared out the repository, and verified that everything was working okay. In order to share my data directory via NFS I did need to use apt-get to install nfs-kernel-server, nfs-common, and portmap, but that’s no big deal.

Once I had the first image all set it was time to create a second. I’m using Sun’s VirtualBox for virtualization. It doesn’t have a “clone” command in the UI and you can’t simply do a file copy of the VDI file. Instead, you have to use VBoxManage on the command line. The form of the command that uses the source VDI file name and target file name didn’t work, but using the source VDI UUID did:

BoxManage clonevdi 19a7646e-d5cb-4e01-90fd-2bcd556dc1d5 "Ubuntu Test Server Clone.vdi"

It was weird that I had to use the source UUID instead of the file name, but I got what I wanted.

Setup the network

I used VirtualBox “host only” networking for ease of setup. This allowed my host machine to see the images and the images to see each other.

My server image was originally set up to use DHCP. That appeared to be giving Alfresco and JGroups trouble so I converted the images to use static IP addresses, unique host names, and updated hosts files (I didn’t want to set up DNS). That left me with three machines (one host and two virtual machines called node1 and node2) that could ping each other by name.

Setup the content store & database

At this point I’ve got two identical Alfresco servers, but each have their own database and data store. For my test, they needed to point to the same database. They also needed to share the content store but have their own local Lucene index.

For this test I decided to use the database and file system on node1 for both nodes. In real life, that wouldn’t be a good setup because losing node1 would bring down the whole cluster. For a shared db/file system setup, you’d want separate nodes, each clustered, for the db and file system.

My Alfresco content store is in “/srv”. I wanted to use NFS to share the content store with the other nodes in my cluster, so I edited /etc/exports to add a new entry for the “/srv” directory. I used an IP address range here but I could have used explicit host names.

/srv 192.168.56.0/25(rw,no_root_squash,async)

You have to restart the nfs-kernel to make that change take effect:

/etc/init.d/nfs-kernel-server restart

Then, I split out the content and index stores into three directories:

/srv/alfresco-3.1-enterprise /srv/alfresco-3.1-enterprise-local-index /srv/alfresco-3.1-enterprise-local-index-backup

And updated custom-repository.properties accordingly:

dir.root=/srv/alfresco-3.1-enterprise dir.indexes=/srv/alfresco-3.1-enterprise-local-index dir.indexes.backup=/srv/alfresco-3.1-enterprise-local-index-backup

The second node will access the database remotely, so MySQL needed to know about that:

grant all on alfresco31e.* to 'alfresco31e'@'192.168.56.4' identified by 'alfresco31e' with grant option;

Later it seemed that node1 was accessing MySQL via its static IP address rather than localhost as it used to. Rather than figure out why or where that’s config’d, I just ran the same command as the above for node1’s static IP.

With node1 all set, it was time to give node2 some attention…

My original plan was to NFS mount the node1 data directory as something like “/srv/alfresco-labs-3d-shared” because using the same directory name I would have used on a single node seemed confusing. As it turned out, I think Alfresco must keep track of that data directory name because it complained that my “dir.root” was set incorrectly. So I wound up using the same directory names that I used on node1 and making the same update to custom-repository.properties:

dir.root=/srv/alfresco-3.1-enterprise dir.indexes=/srv/alfresco-3.1-enterprise-local-index dir.indexes.backup=/srv/alfresco-3.1-enterprise-local-index-backup

Then I mounted the data directory:

mount 192.168.56.3:/srv/alfresco-3.1-enterprise /srv/alfresco-3.1-enterprise

I didn’t do it, but it would be smart to update /etc/fstab so that the data directory would be automatically mounted on server startup.

With that the data directories are all set. Telling node2 to use the database on node1 instead of localhost was a simple custom-repository.properties change:

db.url=jdbc:mysql://node1.alfresco.jpotts.com/alfresco31e

Now node1 and node2 are pointing to the same content store and database, and each have their own Lucene index. The last step was to configure the cluster.

Configure the cluster

Configuring the cluster involved enabling the sample ehcluster-config.xml and making a few small changes to custom-repository.properties.

To enable the ehcluster-config, I copied the ehcluster-config.xml.sample file that came with the sample extensions to ehcluster-config.xml to my extensions directory. No other changes were needed in this particular case.

In custom-repository.properties, you have to assign a cluster name to activate the cluster. The index recovery mode needs to be set to AUTO so the indexes stay in sync:

alfresco.cluster.name=testcluster index.recovery.mode=AUTO

In Alfresco 3.1, Alfresco uses JGroups to discover and coordinate cluster members. It has configurable protocols it uses for cluster member communication. The default is set to UDP but I couldn’t get that to work, so I changed it to TCP. I also found that I had to list the hosts in my cluster in order for the two nodes to find each other:

alfresco.jgroups.defaultProtocol=TCP alfresco.tcp.initial_hosts=node1.alfresco.jpotts.com[7800],node2.alfresco.jpotts.com[7800]

As you can see, most of the work was really about networking and data setup. The cluster configuration itself is actually pretty minimal.

Test the cluster

Before starting Tomcat on the two nodes, I enabled a log4j logger so I could see nodes join and leave the cluster:

log4j.logger.org.alfresco.enterprise.repo.cache.jgroups=INFO

After starting up Tomcat, I eventually saw this in catalina.out:

06:24:52,043 INFO [repo.jgroups.AlfrescoJGroupsChannelFactory] Created JChannelFactory: Cluster Name: testcluster Stack Mapping: {DEFAULT=TCP} Configuration: file:/opt/apache/apache-tomcat-5.5.27/webapps/alfresco/WEB-INF/classes/alfresco/jgroups-default.xml

——————————————————-
GMS: address is 192.168.56.3:7800
——————————————————-

When the second node joined the cluster, the first node knew about it:

06:26:21,241 INFO [cache.jgroups.JGroupsKeepAliveHeartbeatReceiver] New cluster view with additional members: Last View: null New View: [192.168.56.3:7800|1] [192.168.56.3:7800, 192.168.56.4:7800]

Once the nodes could see each other it was time to test it out from an end-user perspective. Obviously, in production you’ll have a load-balancer in front of these two nodes. For testing the cluster, though, you want to be able to hit each node specifically. I used two different browsers on the host machine logging in as two different users. There are some short test scenarios on the Alfresco wiki. In addition to those, you might want to:

Create, delete, and update content while a second node is shut down. Start the second node and see if you can navigate to, search for, and read the properties of content as you would expect. Note that it may take a few seconds for the cache and Lucene index to update.
Check out content in one browser and verify that it is checked out on the other.
Simultaneously edit content properties.
Open the edit properties page in one browser and delete the object in another.

That’s it
In a real-world production environment you often have numerous networking issues to deal with that makes this more of a headache, but hopefully this gives you an idea of the basic steps involved, and shows you how to get familiar with it by setting up your own test cluster using virtual machines.

21 comments

June 8, 2009 at 2:25 pm

jpotts says:

Something I didn’t make obvious in this write-up: The steps under “Configure the cluster” happen on both nodes. You might be able to just copy the custom-repository.properties file from one node to the other to save some typing.
June 9, 2009 at 4:09 am

Pawel says:

Is JGroups avalible only in Enterprise “commercial” version of alfresco ?
June 9, 2009 at 8:13 am

jpotts says:

Excellent question. I haven’t tried it on Labs. A quick check of Labs 3d shows that the jgroups JAR file is in WEB-INF/lib but the jgroups-default.xml file is missing. The alfresco.cluster.name and alfresco.jgroups variables are also missing from the default custom-repository.properties file.

I seem to recall that Alfresco had recently made a decision to make the Labs edition more stable essentially at the price of making certain features (like clustering, monitoring, and Oracle-specific tweaks) available in Enterprise only.

With that said, the Alfresco wiki page on Configuring JGroups says, “JGroups is supported in both the Open and Enterprise code lines, but is only used for cache communication in Enterprise. The core setup of JGroups is therefore common to both code streams.”

So my guess is that you can get it to work but you’ll have to supply some of the missing configuration yourself (or see if someone in the community has already done it).

If you try it and get it working, please update us with your findings.
August 10, 2009 at 8:21 pm

Tuan says:

Hi Jeff –

Thanks for sharing the article. In Administering an Alfresco ECM Production Environment 3_1.pdf document on page 107 it lists three properties need to be set in custom-repository.properties file and they are

dir.contentstore=
dir.contentstore.deleted=
dir.auditcontentstore=

I didn’t see you mentioned those properties in your article. From reading the PDF guide I was not sure whether those properties are for the replicated content store and they are not for shared content store.

In our case we have a shared content store and a shared database similar to your setup.

Would you please verify those properties.

Thanks,
Tuan
August 11, 2009 at 9:11 am

jpotts says:

The content store, the deleted content store, and the audit content store are defaulted to locations below whatever you set to dir.root. So, unless you need to move these for some reason, there’s no reason to set them explicitly.

Also, you mentioned a “replicated” content store. In the setup I describe in this post, there is not a replicated content store. There is a single content store that both Alfresco nodes share.

Jeff
August 11, 2009 at 11:42 am

Tuan says:

Thanks, Jeff.
August 11, 2009 at 8:40 pm

Tuan says:

Hi Jeff –

I have posted a message on Alfresco forum and ACT. I am experiencing a strange issue. If I pointed dir.root to a local directory (eg: dir.root=/opt/alfresco/alf_data), Alfresco started fine. However, after I changed dir.root to a mounted NAS storage (eg: dir.root=/alf_data_cluster) as part of configuring Alfresco clustering , Alfresco failed to start. The error msg I got is

2009-08-11 17:15:33,949 INFO [STDOUT] 17:15:33,949 User:System INFO [repo.admin.ConfigurationChecker] The Alfresco root data directory (‘dir.root’) is: /alf_data_cluster
2009-08-11 17:15:33,983 INFO [STDOUT] 17:15:33,983 User:System ERROR [repo.admin.ConfigurationChecker] CONTENT INTEGRITY ERROR: System content not found in content store.
2009-08-11 17:15:33,983 INFO [STDOUT] 17:15:33,983 User:System ERROR [repo.admin.ConfigurationChecker] Ensure that the ‘dir.root’ property is pointing to the correct data location.
2009-08-11 17:15:33,988 INFO [STDOUT] 17:15:33,986 User:System ERROR [web.context.ContextLoader] Context initialization failed
org.alfresco.error.AlfrescoRuntimeException: Ensure that the ‘dir.root’ property is pointing to the correct data location.
at org.alfresco.repo.admin.ConfigurationChecker.check (ConfigurationChecker.java:312)

The alfresco UNIX user has RW to the mounted directory.

Have you seen this kind of error in the past?

Thanks for your time.
Tuan
August 13, 2009 at 8:31 am

jpotts says:

Tuan,

When you changed your props to point to the NAS-mounted directory, did you also move all of the files from /opt/alfresco/alf_data to /alf_data_cluster? Or are you starting with completely clean repository (clean data directory AND clean database)?

Also, make sure the account running Alfresco has full access to the directory (can create/delete files).

Jeff
August 26, 2009 at 1:29 am

Björn says:

I’d just like to add a few notes to these instructions, to help anyone who might bump into the same problems I did:

* If you configure an Alfresco Cluster setup on Linux, be aware that multicast (UDP) does not work properly if you have IPv6. You should pass this flag to the JVM when starting Alfresco:

-Djava.net.preferIPv4Stack=true

This is not mentioned in the Alfresco documentation as far as I am aware, but is quite clearly noted on the JGroups website.

* Also be aware that the alfresco.tcp.initial_hosts is indeed a required property if using JGroups TCP protocol stack – also noted on the JGroups website.

Thanks for a good tutorial, as usual.
Pingback: Alfresco 3.1: Montar un Clúster :: Ramiro Nahuel Pol
November 23, 2009 at 12:07 pm

chilonnb says:

HI,

I have problem on install Alfresco community Edition -3.2r-Linux-x86 on RHEL 5.2 X64 and All the application is located on /opt/alfresco but the directory alf_data is located on NAS mounted directory (/data/ GFS) and
i have modified dir.root=/opt/alfresco/alf_data TO dir.root=/data/alfresco/alf_data

this is the catalina.out:

Nov 19, 2009 8: 7:30 PM org.apache.catalina.core.StandardService start
INFO: Starting ervice Catalina
Nov 19, 2009 8: 7:30 PM org.apache.catalina.core.StandardEngine start
INFO: Starting ervlet Engine: Apache Tomcat/6.0.18
Nov 19, 2009 8: 7:33 PM org.apache.catalina.core.StandardContext addApplicationListener
INFO: The liste er “org.apache.myfaces.webapp.StartupServletContextListener” is already configured for this context. The duplicate definition has been ignored.
20:27:38,576 I FO [alfresco.config.JndiPropertiesFactoryBean] Loading properties file from class path resource [alfresco/repository.properties]
20:27:38,594 I FO [alfresco.config.JndiPropertiesFactoryBean] Loading properties file from class path resource [alfresco/domain/transaction.properties]
20:27:38,594 I FO [alfresco.config.JndiPropertiesFactoryBean] Loading properties file from URL [file:/data/atlassian/alfresco/tomcat/shared/classes/alfresco-global.properties]
20:27:38,693 I FO [alfresco.config.JndiPropertyPlaceholderConfigurer] Loading properties file from class path resource [alfresco/alfresco-shared.properties]
20:27:47,269 I FO [alfresco.config.JndiPropertiesFactoryBean] Loading properties file from file [/data/atlassian/alfresco/tomcat/shared/classes/alfresco/extension/subsystems/Authentication/ldap/ldap1/ldap-authentication.properties]
20:27:58,711 I FO [domain.schema.SchemaBootstrap] Schema managed by database dialect org.hibernate.dialect.MySQLInnoDBDialect.
20:27:58,876 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37840.sql (Generated).
20:28:01,856 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37841.sql (Copied from classpath:alfresco/dbscripts/create/2.2/org.hibernate.dialect.MySQLInnoDBDialect/AlfrescoPostCreate-2.2-MappedFKIndexes.sql).
20:28:01,886 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37842.sql (Copied from classpath:alfresco/dbscripts/create/2.2/org.hibernate.dialect.MySQLInnoDBDialect/AlfrescoPostCreate-2.2-Extra.sql).
20:28:02,593 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37843.sql (Copied from classpath:alfresco/dbscripts/create/2.2/org.hibernate.dialect.MySQLInnoDBDialect/post-create-indexes-04.sql).
20:28:02,735 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37844.sql (Copied from classpath:alfresco/dbscripts/create/3.0/org.hibernate.dialect.MySQLInnoDBDialect/create-activities-extras.sql).
20:28:04,309 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37845.sql (Copied from classpath:alfresco/dbscripts/create/3.2/org.hibernate.dialect.MySQLInnoDBDialect/AlfrescoPostCreate-3.2-LockTables.sql).
20:28:04,378 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37846.sql (Copied from classpath:alfresco/dbscripts/create/3.2/org.hibernate.dialect.MySQLInnoDBDialect/AlfrescoPostCreate-3.2-ContentTables.sql).
20:28:04,502 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37847.sql (Copied from classpath:alfresco/dbscripts/create/3.2/org.hibernate.dialect.MySQLInnoDBDialect/AlfrescoPostCreate-3.2-PropertyValueTables.sql).
20:28:05,274 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37848.sql (Copied from classpath:alfresco/dbscripts/create/3.2/org.hibernate.dialect.MySQLInnoDBDialect/AlfrescoPostCreate-3.2-AuditTables.sql).
20:28:06,185 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37849.sql (Copied from classpath:alfresco/dbscripts/create/3.2/org.hibernate.dialect.MySQLInnoDBDialect/AlfrescoPostCreate-3.2-AvmTables.sql).
20:28:06,678 I FO [domain.schema.SchemaBootstrap] All executed statements: /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-All_Statements-37850.sql.
20:28:07,209 I FO [domain.schema.SchemaBootstrap] Normalized schema dumped to file /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-37851.xml.
#
#
# An unexpected error has been detected by Java Runtime Environment:
#
#
# SIGBUS (0x7) at pc=0x00002aab025c4c40, pid=6430, tid=1093753152
#
#
# Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b23 mixed mode linux-amd64)
# Problematic f ame:
# C [libnio.so 0x4c40] Java_java_nio_MappedByteBuffer_load0+0x50
#
#
# An error repo t file with more information is saved as:
# /data/atlassi n/alfresco/hs_err_pid6430.log

Thanks for help 🙂
November 23, 2009 at 12:22 pm

jpotts says:

No idea. You might have better luck on the Alfresco Forums (http://forums.alfresco.com).

Sorry!

Jeff
December 17, 2009 at 8:39 am

Tommy says:

Hi…

I had already setup 2 Alfresco Enterprise 3.1 with trial license in 2 virtual machine and also set the log level of jgroups to INFO but i still can’t see the log message about join and leave node.

Am I miss something?

Thank you
Tommy
December 18, 2009 at 11:02 am

jpotts says:

@Tommy,

Double-check that you’ve got the defaultProtocol and inital_hosts set correctly. Also note Björn’s comment in this thread about setting java.net.preferIPv4Stack=true if you are on Linux.

Jeff
December 2, 2011 at 1:20 am

Abid Zafar says:

Hello,

Thanks for the valuable information, can any one tell us the cluster configuration, step by step on using Alfresco 3.2.2.7 on windows platform.

An early response would be highly appreciated.

Thanx… & Regards
Abid Zafar
December 2, 2011 at 3:30 pm

jpotts says:

I know you are on 3.2, but you might check http://docs.alfresco.com. There are multiple clustering topics covered in the official documentation that might be applicable to your install.

Jeff
December 6, 2011 at 11:14 pm

Abid Zafar says:

Thanks Jeff for your reply, can you please only name/list the files which need to be changed/configured in 3.2.2.7.

Thanks in advance….
Regards
Abid Z
February 6, 2012 at 12:55 pm

ANGY says:

Hello, Jeff
I need your help, I have installed Alfresco enterprise on cluster with 2 nodes:
node 1: 10.10.0.108
node 2: 10.10.0.173
I have a shared folder /nfs in my node 1

my alfresco-global.properties node 1 is:

###############################
## Common Alfresco Properties #
###############################
dir.root=/nfs/alf_data1
dir.indexes=/alf_data1/alfresco-enterprise-local-index
dir.indexes.backup=/alf_data1/alfresco-enterprise-local-index-backup
alfresco.cluster.name=clusteralfresco
alfresco.jgroups.defaultProtocol=TCP
alfresco.tcp.initial_hosts=10.10.0.108[7800],10.10.0.173[7800]
db.name=orcl
db.username=alfresco
db.password=alfresco
db.host=10.10.0.109
db.port=1521
db.driver=oracle.jdbc.OracleDriver
#db.url=jdbc:oracle:thin:10.10.0.169:1521:XE
db.url=jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(HOST=10.10.0.109)(PROTOCOL=TCP)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SID=orcl)))
hibernate.default_schema=ALFRESCO
db.pool.validate.query=SELECT 1 FROM DUAL
index.recovery.mode=AUTO
alfresco.rmi.services.host=10.10.0.108
################################

my alfresco-global.properties node 2 is:

###############################
## Common Alfresco Properties #
###############################

dir.root=/nfs/alf_data1
dir.indexes=/alf_data1/alfresco-enterprise-local-index
dir.indexes.backup=/alf_data1/alfresco-enterprise-local-index-backup
alfresco.cluster.name=clusteralfresco
alfresco.jgroups.defaultProtocol=TCP
alfresco.tcp.initial_hosts=10.10.0.108[7800],10.10.0.173[7800]
db.name=orcl
db.username=alfresco
db.password=alfresco
db.host=10.10.0.109
db.port=1521
db.driver=oracle.jdbc.OracleDriver
db.url=jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(HOST=10.10.0.109)(PROTOCOL=TCP)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SID=orcl)))
hibernate.default_schema=ALFRESCO
db.pool.validate.query=SELECT 1 FROM DUAL
index.recovery.mode=AUTO
alfresco.rmi.services.host=10.10.0.173
#################################

When I run my nodes, node 2 runs without any problem, but node 1 don’t run, ERROR, I attach the error log.
no matter the order in which the run, the problem is always problem on node 1, I don’t know that I am doing wrong, help me please!!!
February 10, 2012 at 10:54 am

jpotts says:

Angy,

It sounds like you need some urgent help with your cluster. Unfortunately, I don’t have time to look at this today. As you are on Enterprise you should definitely open a ticket with support so they can get you up and running quickly.

Jeff
October 9, 2012 at 1:53 am

yogesh says:

Need urgent help to setup alfresco community version cluster.
While setting up alfresco community version cluster, we are facing problem in EHCACHE replication between servers.
As Alfresco community version use UDP multicasting for EHCACHE replication. We are running our cluster on Amazon ec2 and Amazon ec2 does not support UDP multicasting. Due to this constraint on amazon ec2, we can not use UDP multicasting of Alfresco community version on our cluster for ehcache replication.

By referring alfresco documentation we found that, alfresco enterprise version support EHCACHE replication using Jgroup over TCP protocol (http://wiki.alfresco.com/wiki/Configuring_JGroups_and_Alfresco_Clusters). But Alfresco community version does not have Jgroup over TCP support.

There is one possible approach to solve this problem – We can configure Alfresco Community version to use Jgroup over TCP to replicate EHCACHE. But there is no documentation available over internate regarding using Jgroup on alfresco community version.
Does any one tried it out and setup Alfresco cluster using Jgroup on community version?
October 9, 2012 at 7:52 am

jpotts says:

I highly recommend you do not run Community Edition in a cluster like this. It was never meant to work–Community Edition has never supported clustering. In fact, in the forthcoming 4.2a release, all clustering related bits have been removed. So if you do get it to work, you’ll be stuck on that version.

Jeff

Comments are closed.