Alfresco 3.1 clustering easier with JGroups

Optaros has worked on some of the largest and most complex Alfresco implementations anywhere. Projects where multi-node read-write clusters are required have been particularly challenging. So when Alfresco announced clustering improvements in 3.1 my interest was piqued.

I decided I’d do a simple test: Get a two-node read-write Alfresco 3.1 cluster running using a shared MySQL database and a shared file store (as opposed to a replicated database and a replicated file store). The process is mostly documented here but I thought I’d capture the steps I went through in case someone finds them helpful.

Prepare the virtual machines

If you already have virtual or physical machines ready to go, go on to “Setup the content store & database”.

I already had an Ubuntu server virtual machine image with everything I needed for the test. I upgraded it to Alfresco 3.1, cleared out the repository, and verified that everything was working okay. In order to share my data directory via NFS I did need to use apt-get to install nfs-kernel-server, nfs-common, and portmap, but that’s no big deal.

Once I had the first image all set it was time to create a second. I’m using Sun’s VirtualBox for virtualization. It doesn’t have a “clone” command in the UI and you can’t simply do a file copy of the VDI file. Instead, you have to use VBoxManage on the command line. The form of the command that uses the source VDI file name and target file name didn’t work, but using the source VDI UUID did:

BoxManage clonevdi 19a7646e-d5cb-4e01-90fd-2bcd556dc1d5 "Ubuntu Test Server Clone.vdi"

It was weird that I had to use the source UUID instead of the file name, but I got what I wanted.

Setup the network

I used VirtualBox “host only” networking for ease of setup. This allowed my host machine to see the images and the images to see each other.

My server image was originally set up to use DHCP. That appeared to be giving Alfresco and JGroups trouble so I converted the images to use static IP addresses, unique host names, and updated hosts files (I didn’t want to set up DNS). That left me with three machines (one host and two virtual machines called node1 and node2) that could ping each other by name.

Setup the content store & database

At this point I’ve got two identical Alfresco servers, but each have their own database and data store. For my test, they needed to point to the same database. They also needed to share the content store but have their own local Lucene index.

For this test I decided to use the database and file system on node1 for both nodes. In real life, that wouldn’t be a good setup because losing node1 would bring down the whole cluster. For a shared db/file system setup, you’d want separate nodes, each clustered, for the db and file system.

My Alfresco content store is in “/srv”. I wanted to use NFS to share the content store with the other nodes in my cluster, so I edited /etc/exports to add a new entry for the “/srv” directory. I used an IP address range here but I could have used explicit host names.

/srv 192.168.56.0/25(rw,no_root_squash,async)

You have to restart the nfs-kernel to make that change take effect:

/etc/init.d/nfs-kernel-server restart

Then, I split out the content and index stores into three directories:

/srv/alfresco-3.1-enterprise
/srv/alfresco-3.1-enterprise-local-index
/srv/alfresco-3.1-enterprise-local-index-backup

And updated custom-repository.properties accordingly:

dir.root=/srv/alfresco-3.1-enterprise
dir.indexes=/srv/alfresco-3.1-enterprise-local-index
dir.indexes.backup=/srv/alfresco-3.1-enterprise-local-index-backup

The second node will access the database remotely, so MySQL needed to know about that:

grant all on alfresco31e.* to 'alfresco31e'@'192.168.56.4' identified by 'alfresco31e' with grant option;

Later it seemed that node1 was accessing MySQL via its static IP address rather than localhost as it used to. Rather than figure out why or where that’s config’d, I just ran the same command as the above for node1’s static IP.

With node1 all set, it was time to give node2 some attention…

My original plan was to NFS mount the node1 data directory as something like “/srv/alfresco-labs-3d-shared” because using the same directory name I would have used on a single node seemed confusing. As it turned out, I think Alfresco must keep track of that data directory name because it complained that my “dir.root” was set incorrectly. So I wound up using the same directory names that I used on node1 and making the same update to custom-repository.properties:

dir.root=/srv/alfresco-3.1-enterprise
dir.indexes=/srv/alfresco-3.1-enterprise-local-index
dir.indexes.backup=/srv/alfresco-3.1-enterprise-local-index-backup

Then I mounted the data directory:

mount 192.168.56.3:/srv/alfresco-3.1-enterprise /srv/alfresco-3.1-enterprise

I didn’t do it, but it would be smart to update /etc/fstab so that the data directory would be automatically mounted on server startup.

With that the data directories are all set. Telling node2 to use the database on node1 instead of localhost was a simple custom-repository.properties change:

db.url=jdbc:mysql://node1.alfresco.jpotts.com/alfresco31e

Now node1 and node2 are pointing to the same content store and database, and each have their own Lucene index. The last step was to configure the cluster.

Configure the cluster

Configuring the cluster involved enabling the sample ehcluster-config.xml and making a few small changes to custom-repository.properties.

To enable the ehcluster-config, I copied the ehcluster-config.xml.sample file that came with the sample extensions to ehcluster-config.xml to my extensions directory. No other changes were needed in this particular case.

In custom-repository.properties, you have to assign a cluster name to activate the cluster. The index recovery mode needs to be set to AUTO so the indexes stay in sync:

alfresco.cluster.name=testcluster
index.recovery.mode=AUTO

In Alfresco 3.1, Alfresco uses JGroups to discover and coordinate cluster members. It has configurable protocols it uses for cluster member communication. The default is set to UDP but I couldn’t get that to work, so I changed it to TCP. I also found that I had to list the hosts in my cluster in order for the two nodes to find each other:

alfresco.jgroups.defaultProtocol=TCP
alfresco.tcp.initial_hosts=node1.alfresco.jpotts.com[7800],node2.alfresco.jpotts.com[7800]

As you can see, most of the work was really about networking and data setup. The cluster configuration itself is actually pretty minimal.

Test the cluster

Before starting Tomcat on the two nodes, I enabled a log4j logger so I could see nodes join and leave the cluster:

log4j.logger.org.alfresco.enterprise.repo.cache.jgroups=INFO

After starting up Tomcat, I eventually saw this in catalina.out:

06:24:52,043 INFO [repo.jgroups.AlfrescoJGroupsChannelFactory]
Created JChannelFactory:
Cluster Name: testcluster
Stack Mapping: {DEFAULT=TCP}
Configuration: file:/opt/apache/apache-tomcat-5.5.27/webapps/alfresco/WEB-INF/classes/alfresco/jgroups-default.xml

——————————————————-
GMS: address is 192.168.56.3:7800
——————————————————-

When the second node joined the cluster, the first node knew about it:

06:26:21,241 INFO [cache.jgroups.JGroupsKeepAliveHeartbeatReceiver]
New cluster view with additional members:
Last View: null
New View: [192.168.56.3:7800|1] [192.168.56.3:7800, 192.168.56.4:7800]

Once the nodes could see each other it was time to test it out from an end-user perspective. Obviously, in production you’ll have a load-balancer in front of these two nodes. For testing the cluster, though, you want to be able to hit each node specifically. I used two different browsers on the host machine logging in as two different users. There are some short test scenarios on the Alfresco wiki. In addition to those, you might want to:

  • Create, delete, and update content while a second node is shut down. Start the second node and see if you can navigate to, search for, and read the properties of content as you would expect. Note that it may take a few seconds for the cache and Lucene index to update.
  • Check out content in one browser and verify that it is checked out on the other.
  • Simultaneously edit content properties.
  • Open the edit properties page in one browser and delete the object in another.

That’s it
In a real-world production environment you often have numerous networking issues to deal with that makes this more of a headache, but hopefully this gives you an idea of the basic steps involved, and shows you how to get familiar with it by setting up your own test cluster using virtual machines.

21 comments

  1. jpotts says:

    Something I didn’t make obvious in this write-up: The steps under “Configure the cluster” happen on both nodes. You might be able to just copy the custom-repository.properties file from one node to the other to save some typing.

  2. jpotts says:

    Excellent question. I haven’t tried it on Labs. A quick check of Labs 3d shows that the jgroups JAR file is in WEB-INF/lib but the jgroups-default.xml file is missing. The alfresco.cluster.name and alfresco.jgroups variables are also missing from the default custom-repository.properties file.

    I seem to recall that Alfresco had recently made a decision to make the Labs edition more stable essentially at the price of making certain features (like clustering, monitoring, and Oracle-specific tweaks) available in Enterprise only.

    With that said, the Alfresco wiki page on Configuring JGroups says, “JGroups is supported in both the Open and Enterprise code lines, but is only used for cache communication in Enterprise. The core setup of JGroups is therefore common to both code streams.”

    So my guess is that you can get it to work but you’ll have to supply some of the missing configuration yourself (or see if someone in the community has already done it).

    If you try it and get it working, please update us with your findings.

  3. Tuan says:

    Hi Jeff –

    Thanks for sharing the article. In Administering an Alfresco ECM Production Environment 3_1.pdf document on page 107 it lists three properties need to be set in custom-repository.properties file and they are

    dir.contentstore=
    dir.contentstore.deleted=
    dir.auditcontentstore=

    I didn’t see you mentioned those properties in your article. From reading the PDF guide I was not sure whether those properties are for the replicated content store and they are not for shared content store.

    In our case we have a shared content store and a shared database similar to your setup.

    Would you please verify those properties.

    Thanks,
    Tuan

  4. jpotts says:

    The content store, the deleted content store, and the audit content store are defaulted to locations below whatever you set to dir.root. So, unless you need to move these for some reason, there’s no reason to set them explicitly.

    Also, you mentioned a “replicated” content store. In the setup I describe in this post, there is not a replicated content store. There is a single content store that both Alfresco nodes share.

    Jeff

  5. Tuan says:

    Hi Jeff –

    I have posted a message on Alfresco forum and ACT. I am experiencing a strange issue. If I pointed dir.root to a local directory (eg: dir.root=/opt/alfresco/alf_data), Alfresco started fine. However, after I changed dir.root to a mounted NAS storage (eg: dir.root=/alf_data_cluster) as part of configuring Alfresco clustering , Alfresco failed to start. The error msg I got is

    2009-08-11 17:15:33,949 INFO [STDOUT] 17:15:33,949 User:System INFO [repo.admin.ConfigurationChecker] The Alfresco root data directory (‘dir.root’) is: /alf_data_cluster
    2009-08-11 17:15:33,983 INFO [STDOUT] 17:15:33,983 User:System ERROR [repo.admin.ConfigurationChecker] CONTENT INTEGRITY ERROR: System content not found in content store.
    2009-08-11 17:15:33,983 INFO [STDOUT] 17:15:33,983 User:System ERROR [repo.admin.ConfigurationChecker] Ensure that the ‘dir.root’ property is pointing to the correct data location.
    2009-08-11 17:15:33,988 INFO [STDOUT] 17:15:33,986 User:System ERROR [web.context.ContextLoader] Context initialization failed
    org.alfresco.error.AlfrescoRuntimeException: Ensure that the ‘dir.root’ property is pointing to the correct data location.
    at org.alfresco.repo.admin.ConfigurationChecker.check (ConfigurationChecker.java:312)

    The alfresco UNIX user has RW to the mounted directory.

    Have you seen this kind of error in the past?

    Thanks for your time.
    Tuan

  6. jpotts says:

    Tuan,

    When you changed your props to point to the NAS-mounted directory, did you also move all of the files from /opt/alfresco/alf_data to /alf_data_cluster? Or are you starting with completely clean repository (clean data directory AND clean database)?

    Also, make sure the account running Alfresco has full access to the directory (can create/delete files).

    Jeff

  7. Björn says:

    I’d just like to add a few notes to these instructions, to help anyone who might bump into the same problems I did:

    * If you configure an Alfresco Cluster setup on Linux, be aware that multicast (UDP) does not work properly if you have IPv6. You should pass this flag to the JVM when starting Alfresco:

    -Djava.net.preferIPv4Stack=true

    This is not mentioned in the Alfresco documentation as far as I am aware, but is quite clearly noted on the JGroups website.

    * Also be aware that the alfresco.tcp.initial_hosts is indeed a required property if using JGroups TCP protocol stack – also noted on the JGroups website.

    Thanks for a good tutorial, as usual.

  8. chilonnb says:

    HI,

    I have problem on install Alfresco community Edition -3.2r-Linux-x86 on RHEL 5.2 X64 and All the application is located on /opt/alfresco but the directory alf_data is located on NAS mounted directory (/data/ GFS) and
    i have modified dir.root=/opt/alfresco/alf_data TO dir.root=/data/alfresco/alf_data

    this is the catalina.out:

    Nov 19, 2009 8: 7:30 PM org.apache.catalina.core.StandardService start
    INFO: Starting ervice Catalina
    Nov 19, 2009 8: 7:30 PM org.apache.catalina.core.StandardEngine start
    INFO: Starting ervlet Engine: Apache Tomcat/6.0.18
    Nov 19, 2009 8: 7:33 PM org.apache.catalina.core.StandardContext addApplicationListener
    INFO: The liste er “org.apache.myfaces.webapp.StartupServletContextListener” is already configured for this context. The duplicate definition has been ignored.
    20:27:38,576 I FO [alfresco.config.JndiPropertiesFactoryBean] Loading properties file from class path resource [alfresco/repository.properties]
    20:27:38,594 I FO [alfresco.config.JndiPropertiesFactoryBean] Loading properties file from class path resource [alfresco/domain/transaction.properties]
    20:27:38,594 I FO [alfresco.config.JndiPropertiesFactoryBean] Loading properties file from URL [file:/data/atlassian/alfresco/tomcat/shared/classes/alfresco-global.properties]
    20:27:38,693 I FO [alfresco.config.JndiPropertyPlaceholderConfigurer] Loading properties file from class path resource [alfresco/alfresco-shared.properties]
    20:27:47,269 I FO [alfresco.config.JndiPropertiesFactoryBean] Loading properties file from file [/data/atlassian/alfresco/tomcat/shared/classes/alfresco/extension/subsystems/Authentication/ldap/ldap1/ldap-authentication.properties]
    20:27:58,711 I FO [domain.schema.SchemaBootstrap] Schema managed by database dialect org.hibernate.dialect.MySQLInnoDBDialect.
    20:27:58,876 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37840.sql (Generated).
    20:28:01,856 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37841.sql (Copied from classpath:alfresco/dbscripts/create/2.2/org.hibernate.dialect.MySQLInnoDBDialect/AlfrescoPostCreate-2.2-MappedFKIndexes.sql).
    20:28:01,886 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37842.sql (Copied from classpath:alfresco/dbscripts/create/2.2/org.hibernate.dialect.MySQLInnoDBDialect/AlfrescoPostCreate-2.2-Extra.sql).
    20:28:02,593 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37843.sql (Copied from classpath:alfresco/dbscripts/create/2.2/org.hibernate.dialect.MySQLInnoDBDialect/post-create-indexes-04.sql).
    20:28:02,735 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37844.sql (Copied from classpath:alfresco/dbscripts/create/3.0/org.hibernate.dialect.MySQLInnoDBDialect/create-activities-extras.sql).
    20:28:04,309 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37845.sql (Copied from classpath:alfresco/dbscripts/create/3.2/org.hibernate.dialect.MySQLInnoDBDialect/AlfrescoPostCreate-3.2-LockTables.sql).
    20:28:04,378 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37846.sql (Copied from classpath:alfresco/dbscripts/create/3.2/org.hibernate.dialect.MySQLInnoDBDialect/AlfrescoPostCreate-3.2-ContentTables.sql).
    20:28:04,502 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37847.sql (Copied from classpath:alfresco/dbscripts/create/3.2/org.hibernate.dialect.MySQLInnoDBDialect/AlfrescoPostCreate-3.2-PropertyValueTables.sql).
    20:28:05,274 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37848.sql (Copied from classpath:alfresco/dbscripts/create/3.2/org.hibernate.dialect.MySQLInnoDBDialect/AlfrescoPostCreate-3.2-AuditTables.sql).
    20:28:06,185 I FO [domain.schema.SchemaBootstrap] Executing database script /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-Update-37849.sql (Copied from classpath:alfresco/dbscripts/create/3.2/org.hibernate.dialect.MySQLInnoDBDialect/AlfrescoPostCreate-3.2-AvmTables.sql).
    20:28:06,678 I FO [domain.schema.SchemaBootstrap] All executed statements: /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-All_Statements-37850.sql.
    20:28:07,209 I FO [domain.schema.SchemaBootstrap] Normalized schema dumped to file /data/atlassian/alfresco/tomcat/temp/Alfresco/AlfrescoSchema-MySQLInnoDBDialect-37851.xml.
    #
    #
    # An unexpected error has been detected by Java Runtime Environment:
    #
    #
    # SIGBUS (0x7) at pc=0x00002aab025c4c40, pid=6430, tid=1093753152
    #
    #
    # Java VM: Java HotSpot(TM) 64-Bit Server VM (10.0-b23 mixed mode linux-amd64)
    # Problematic f ame:
    # C [libnio.so 0x4c40] Java_java_nio_MappedByteBuffer_load0+0x50
    #
    #
    # An error repo t file with more information is saved as:
    # /data/atlassi n/alfresco/hs_err_pid6430.log

    Thanks for help 🙂

  9. Tommy says:

    Hi…

    I had already setup 2 Alfresco Enterprise 3.1 with trial license in 2 virtual machine and also set the log level of jgroups to INFO but i still can’t see the log message about join and leave node.

    Am I miss something?

    Thank you
    Tommy

  10. jpotts says:

    @Tommy,

    Double-check that you’ve got the defaultProtocol and inital_hosts set correctly. Also note Björn’s comment in this thread about setting java.net.preferIPv4Stack=true if you are on Linux.

    Jeff

  11. Abid Zafar says:

    Hello,

    Thanks for the valuable information, can any one tell us the cluster configuration, step by step on using Alfresco 3.2.2.7 on windows platform.

    An early response would be highly appreciated.

    Thanx… & Regards
    Abid Zafar

  12. Abid Zafar says:

    Thanks Jeff for your reply, can you please only name/list the files which need to be changed/configured in 3.2.2.7.

    Thanks in advance….
    Regards
    Abid Z

  13. ANGY says:

    Hello, Jeff
    I need your help, I have installed Alfresco enterprise on cluster with 2 nodes:
    node 1: 10.10.0.108
    node 2: 10.10.0.173
    I have a shared folder /nfs in my node 1

    my alfresco-global.properties node 1 is:

    ###############################
    ## Common Alfresco Properties #
    ###############################
    dir.root=/nfs/alf_data1
    dir.indexes=/alf_data1/alfresco-enterprise-local-index
    dir.indexes.backup=/alf_data1/alfresco-enterprise-local-index-backup
    alfresco.cluster.name=clusteralfresco
    alfresco.jgroups.defaultProtocol=TCP
    alfresco.tcp.initial_hosts=10.10.0.108[7800],10.10.0.173[7800]
    db.name=orcl
    db.username=alfresco
    db.password=alfresco
    db.host=10.10.0.109
    db.port=1521
    db.driver=oracle.jdbc.OracleDriver
    #db.url=jdbc:oracle:thin:10.10.0.169:1521:XE
    db.url=jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(HOST=10.10.0.109)(PROTOCOL=TCP)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SID=orcl)))
    hibernate.default_schema=ALFRESCO
    db.pool.validate.query=SELECT 1 FROM DUAL
    index.recovery.mode=AUTO
    alfresco.rmi.services.host=10.10.0.108
    ################################

    my alfresco-global.properties node 2 is:

    ###############################
    ## Common Alfresco Properties #
    ###############################

    dir.root=/nfs/alf_data1
    dir.indexes=/alf_data1/alfresco-enterprise-local-index
    dir.indexes.backup=/alf_data1/alfresco-enterprise-local-index-backup
    alfresco.cluster.name=clusteralfresco
    alfresco.jgroups.defaultProtocol=TCP
    alfresco.tcp.initial_hosts=10.10.0.108[7800],10.10.0.173[7800]
    db.name=orcl
    db.username=alfresco
    db.password=alfresco
    db.host=10.10.0.109
    db.port=1521
    db.driver=oracle.jdbc.OracleDriver
    db.url=jdbc:oracle:thin:@(DESCRIPTION=(ADDRESS=(HOST=10.10.0.109)(PROTOCOL=TCP)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SID=orcl)))
    hibernate.default_schema=ALFRESCO
    db.pool.validate.query=SELECT 1 FROM DUAL
    index.recovery.mode=AUTO
    alfresco.rmi.services.host=10.10.0.173
    #################################

    When I run my nodes, node 2 runs without any problem, but node 1 don’t run, ERROR, I attach the error log.
    no matter the order in which the run, the problem is always problem on node 1, I don’t know that I am doing wrong, help me please!!!

  14. jpotts says:

    Angy,

    It sounds like you need some urgent help with your cluster. Unfortunately, I don’t have time to look at this today. As you are on Enterprise you should definitely open a ticket with support so they can get you up and running quickly.

    Jeff

  15. yogesh says:

    Need urgent help to setup alfresco community version cluster.
    While setting up alfresco community version cluster, we are facing problem in EHCACHE replication between servers.
    As Alfresco community version use UDP multicasting for EHCACHE replication. We are running our cluster on Amazon ec2 and Amazon ec2 does not support UDP multicasting. Due to this constraint on amazon ec2, we can not use UDP multicasting of Alfresco community version on our cluster for ehcache replication.

    By referring alfresco documentation we found that, alfresco enterprise version support EHCACHE replication using Jgroup over TCP protocol (http://wiki.alfresco.com/wiki/Configuring_JGroups_and_Alfresco_Clusters). But Alfresco community version does not have Jgroup over TCP support.

    There is one possible approach to solve this problem – We can configure Alfresco Community version to use Jgroup over TCP to replicate EHCACHE. But there is no documentation available over internate regarding using Jgroup on alfresco community version.
    Does any one tried it out and setup Alfresco cluster using Jgroup on community version?

  16. jpotts says:

    I highly recommend you do not run Community Edition in a cluster like this. It was never meant to work–Community Edition has never supported clustering. In fact, in the forthcoming 4.2a release, all clustering related bits have been removed. So if you do get it to work, you’ll be stuck on that version.

    Jeff

Comments are closed.