25 11 2011

In spare moments between real work, I’ve been experimenting with a light weight content server for user generated content. In short, that means content in a hierarchical tree that is shallow and very wide. It doesn’t preclude deep narrow trees, but wide and shallow is what it does best. Here are some of the things I wanted to do.

I wanted to support the same type of RESTfull interface as seen in Sakai OAE’s Nakamura and standards like Atom. By that I mean where the URL points to a resource, and actions expressed by the http methods, http protocol and markers in the URL modify what a RESTfull request does. In short, along the lines of the arguments in which probably drove the thinking behind Sling on which Nakamura is based. I mention Atom, simply because when you read the standard  it talks about the payload of a response, but makes no mention of how the URL should be structured to get that payload. It reinforces the earlier desire.

I wanted the server to start as quickly as possible, and use as little memory  as possible. Ideally < 10s and < 20MB. Java applications have got a bad name for bloat but there is no reason they have to be huge to serve load. Why so small (in Java terms)? Why not, contrary to what most apps appear to do, memory is not there to waste?

I wanted the core server to be support some standard protocols. eg WebDav, but I wanted to make it easy to extend. JAX-RS (RestEasy) inside OSGi (Minimal Sling bootstrap + Apache Felix)

I wanted the request processing to be efficient. Stream all requests (commons-upload 1.2.1 with streaming, no writing to intermediate file or byte[] all of which involve high GC traffic and slow processing), all things only processed once and available via an Adaptable pattern, a concept strong in Sling. And requests handled by response objects, not servlets. Why ? So the response state can be thread unsafe, so a request can be suspended in memory and unbound from the thread. And the resolution binding requests to resources to responses to be handled entirely in memory by pointer avoiding iteration. Ok so the lookup of a resource might go through a cache, but the resolution through to resource is an in memory pointer operation.

Where content is static, I wanted to keep it static. OS’s have file systems that are efficient at storing files, efficient at loading those file from disk and eliminating disk access completely, so if the bulk of the static files that my application needs really are static, why not use the filesystem. Many applications seem to confuse statically deterministic and dynamic. If the all possibilities of can be computed at build time, and the resources requires to create and serve are not excessive, then the content is static. Whats excessive ? A production build that takes 15 minutes to process all possibilities once a day is better than continually wasting heat and power doing it all the time. I might be a bit more extreem in that view accepting that filling a TB disk with compiled state is better than continually rebuilding that state incrementally in user facing production requests. If a deployer wants to do something special (SAN, NAS, something cloud like) with that filesystem there are plenty of options. All of Httpd/Tomcat/Jetty are capable of serving static files in high 1000s of requests per second concurrent, so why not use that ability. Browser based apps need every bit of speed they can get for static data.

The downside of all of this minimalism is a server that doesn’t have lots of ways of doing the same thing. Unlike Nakamura, you can’t write JSPs or JRuby servlets. It barely uses the OSGi Event system and has none of the sophistication of Jackrabbit. The core container is Apache Felix with the the Felix HttpSerivice running a minimalist Jetty. The Content System is Sparse Content Map, the search component is Solr as an OSGi bundle. Webdav is provided by Milton and Jax-RS by RestEasy. Cacheing is provided by EhCache. It starts in 8Mb in 12s, and after load drops back to about 10MB.

Additional RESTfull services are creating in one of three ways.

  1. Registering a servlet with the Felix Http Service (whiteboard), which binds to a URL, breaking the desire that nothing should bind to fixed URLs.
  2. Creating a component that provides a marker service, picked up by the OSGi extension to RestEasy that registers that service as a JAX-RS bean.
  3. Creating a factory service that emits JAX-RS annotated classes that act as response objects. The factory is annotated with the type of requests it can deal with, and the response objects tell JAX-RS what they can do with the request. The annotations are discovered when the factory is registered with OSGi, and those annotations are compiled into a one step memory lookup. (single concurrent hashmap get)

Methods 1 and 2 have complete control over the protocol and are wide open to abuse, method 3 follows a processing pattern closely related to Sling.

Integration testing

Well unit testing is obvious, we do it and we try and get 100% coverage of every use case that matters. In fact, if you work on a time an materials basis for anyone, you should read your contract carefully to work out if you have to fix mistakes at your own expense. If you do, then you will probably start writing more tests to prove your client that what you did works. Its no surprise, in other branches of Engineering, that acceptance testing is part of many contracts. I dont think an airline would take delivery of a new plane without starting the engines, or a shipping line take delivery of a super tanker without checking it floats. I am bemused that software engineers often get away with saying “its done”, when clearly its not. Sure we all make mistakes, but delivering code without test coverage is like handing over a ship that sinks.

Integration testing is less obvious. In Sling there is a set of integration tests that test just about everything against a running server. Its part of the standard build but lives in its one project. Its absolutely solid and ensures that nothing builds that is broken, but as an average mortal, I found it scary since when thing did break I had to work hard to find out why. Thats why in Nakamura we wrote all integration tests in scripts. Initially bash and perl then later Ruby. With hindsight this was a huge mistake. First, you had to configure your machine to run Ruby and all the extensions needed. Not too hard on Linux, but for a time, those on OSX would wait forever for ports to finish building some base library. Dependencies gone mad. Fine if you were one of the few who created the system and pulled everything in over many months, but hell for the newcomer. Mostly, the newcomer walks away, or tweets something that everyone ignores.

The devs also get off the hook. New ones dont know where to write the tests, or have to learn Ruby (replace Ruby with whatever the script is). Old devs can sweep them under the carpet and when it gets to release time ignore the fact that 10% of the tests are still broken… because the didn’t have time to maintain them 3 fridays ago at 18:45, just before they went to a party. The party where they zapped 1% of their brain cells including the ones that were remembering what they should have done at 18:49. Still they had a good time, the evening raised their morale, started a great weekend ready for the next week and besides, they had no intention of boarding the ship.

So the integration testing here is done as java unit tests. If this was a c++ project they would be c++ unit tests. They are in the bundle where where the code they test is. They are run by “mvn -Pintegration test”. Even the command says what is going to happen. It starts a full instance of the server (now 12s becomes an age), or uses one thats already running and runs the tests.  If your in eclipse, they can be run in eclipse, just as another test might, and being OSGi, the new code in the bundle can be redeployed to the running OSGi container. That way the dev creating the bundle can put their tests in their bundle and do integration testing with the same tools they did unit testing. No excuse. “find . -type d  -name integration | grep src/test  ” finds  all integration tests, and by omission ships that sink.

Sparse Map Content 1.3 Released

21 11 2011

Sparse Map version 1.3 has been tagged (org.sakaiproject.nakamura.core-1.3) and released. Downloads of the source tree in Zip and TarGZ form are available from GitHub.

In this release 8 issues were addressed, the details are in the issue tracker.  If you find any issues, please mention them to me or, better still, add an issue to the issue tracker. Unless otherwise stated the license is Apache 2. Thanks to everyone who made this release possible.

Issues Fixed:

To use


The Jar is an OSGi bundle complete with Manifest, bundled dependencies and services, ready for use in Apache Felix.

Clustering Sakai OAE: Part II of ?

18 11 2011

Part II : Solr Clustering

This Post carries on from my previous post about clustering Sakai OAE at Charles Sturt University here in Australia. The work done recently has focused on two issues. Firstly a cluster of app server nodes to run against a cluster of Solr nodes, and secondly  ensuring that the UI doesnt flinch of the Solr nodes disappear. There is one caveat there, the UI can flinch if there are no Solr servers alive, but the user should not notice if we bring solr servers down, even if its the master Solr server.

Solr in clusters and in Clouds

Anyone wanting to know in detail how standard Solr clusters or clouds work should go and read the Solr Cloud documentation. Solr clusters on a number of levels, but for the scale of indexes that OAE is going to need to support initially we are unlikely to need much more than basic clustering. By basic clustering I mean using a cluster of Solr servers that each have a copy of the index, and are able to service queries. The next level of clustering up would be to start sharding the Solr index so that we have  sub clusters of Solr servers operating on shards or the entire index. The configuration of both of these is described in the Solr Cloud documentation using ZooKeeper to manage the cluster. At CSU we are using VM’s managed by PuppetD, simple because that integrates with the current architecture. One think I should mention at this stage, as the managers head for the hills with their hands in the air at all this complexity…. these are standard Solr servers, out of the box, running in Tomcat with only mild configuration. Easy to do in 15 minutes. We are not talking about deploying OAE proprietary code here.

Query operations

To make the a cluster of Solr servers take the load of many app server nodes there needs to be load balancing and automatic failover at the app server nodes. CSU uses hardware load balancers from a well known manufacturer that are perfectly capable to performing application layer load balancing, however not all deployment sites have that capability, so we are using the LBHttpSolrServer, configured with a large list of potential Solr servers. The ones that are not up and running are not used which gives us the ability to add more, at will, to take increased app server load. We can, to a degree, scale elastically. We could make this dynamic through ZooKepper or some other lookup mechanism.  There are a number of available LB algorithms with this SolrJ client and in tests we have done adding and removing Solr servers from the cluster goes completely noticed at the UI.

That was the easy part, now the hard part. OAE uses Solr in a close to real time mode. We are not working on indexing a large corpus of slow changing or immutable objects. We are indexing user generated content and we think that our users expect their content to appear in search results immediately. Anyone who knows how non trivial inverted indexes operate will know that, upto a point, that is possible on a single JVM, but replicating that behaviour over a cluster is unrealistic. A highly distributed transactional model, where on commit, everything everywhere is in the same state (+- 1-2ms), becomes hard to deliver as well as providing responsiveness and scalability. Certainly near zero latency between a users actions and that update appearing in all indexes is resource intensive and hard to achieve. So we have adopted a Just in time model. When a user updates something, the data reaches the location where the user is going to look for it, just before they look for it, ie just-in-time. The question that now exists is how much time do we have?  I like to call it Quality of Update Service. UI folks want no more than a few ms, so the next Ajax request finds the data, and that puts a pressure on the application. A simple application, operating on a single JVM can use the Solr index to publish all data everywhere in the index instantly. Solr4 helps us here since it can do soft commits delivering near real time indexing. However to get any sort of scalability and reliability in the application, we can’t operate in a single memory space. The application needs to publish data that it knows will be needed in a few ms to the locations where its needed, achieving the QoUS required. Other locations can be updated more leisurely. That was one of the reasons for moving from a single priority queue in the Solr bundle 1.1 release to a multiple priority queue in the Solr bundle 1.2 release. That improvement doesn’t help the fundamental problem (bear with me, I will get to the point soon) that is not present in a single JVM cluster. Writing to a large distributed inverted index, so that all views of that index are consistent, is not instant. Certainly not in Solr4. So the application, that wants to scale and have no single point of query or failure, must accept that when an update is made, not all dependencies of that data will be updated instantly. OAE is not there yet. It probably demands that OAE looks at what data needs to be where and when. Data with a high QoUS (ie low latency) should be explicitly published within the request cycle. Data with low QoUS should be indexed asynchronously. The request accessing the published data will need to convert from queries through the bitmap of the inverted index into get by pointer operations. This will enable the application to break out of the bottleneck that it been presented with (finally the point). A Solr index only allows a single writer. You cant update the same document in the same Solr core from multiple JVMs or VMs. AFAIK merging commits from multiple segments from different Solr servers on independent disconnected timelines into a single index core is not supported. There is a hard 1:1 relationship between the index writer and the core being updated. So if you lose the process managing the update operation, you loose the ability to update the core.

The Solr bundle used in OAE addresses this by queuing update events in a persistent disk based transactional queue. When the Solr master, the only Solr instance managing the Solr core in question dies, events remain in the queue pending being processed. When the master comes back, the queue restarts update operations and the data that was changed enters the index. This is not rocket science, its very simple, but it does require that the UI isn’t expecting anything in that queue to be available on the next Ajax request. AFAIK, OAE is not in that place yet.

CSU’s OAE/Solr Cluster

So back to the clustering. We have a cluster up and running at CSU. You will remember from Part I, that bouncing app server nodes works. A UI user doesn’t know or notice their state is wandering about the cluster as app server nodes go up and down.  All the queries that app server makes to the Solr server are load balanced and the LB algorithms inside the LBHttpSolrClient that comes with Solr instantly recognises when a Solr slave dies or becomes available. So the users have no idea what might be happening in the underlying infrastructure. When the master Solr instance, off which all the slaves are feeding, dies, indexing operations pause, queuing up events on each app server node. When the Solr master comes back, indexing restarts. The write operation to the queue is concurrent and thread safe allowing both synchronous and asynchronous notification events to be persisted in the queue. That write operation is also detached from the indexing operation, so when indexing stops, the application server continues as if nothing had happened. Provided the UI does not place great dependence on data flowing through this route, no user will notice that the Solr master has been taken offline. As any ops person will tell you, it was taken of line… it didn’t just die. Just remember to ask, what took it offline?

Other issues encountered

Prior to release 1.2 and the work at CSU we were using the StreamingSolrUpdateServer. This is an efficient implementation of the SolrJ client that has a memory queue, and multiple queue processor threads. That queue and pool of workers allows the client to interleave network latency parallelizing multiple concurrent update operations and allowing those to be managed by the worker threads. Only when the client commits does the main client thread block while all previous update threads complete and the queue is emptied. There are several issues here. In the current Solr4 implementation, errors on operations performed by the pool of threads are not communicated back to the main client thread. Hence it has no way of knowing when a remote update server has failed. Only when a commit is performed does the main client thread know there was a problem. This is not too much of a problem for when the server goes down, since the commit will be the last operation and hence rollback the entire update transaction in the client. It also doesn’t matter that the in SolrJ client memory queue is lost, since the on disk queue never gets committed and the document IDs are immutably bound to the content they represent, hence indexing twice does no harm. The problem comes when the queue restarts. The client only knows that the commit completed. It doesn’t know how many of the index update operation performed in that transaction where sucessfull, hence with the StreamingSolrUpdateServer we find that  on restart partial update transactions get through to the index. Switching to the ComonHttpSolrServer, which uses the client thread for all operations addresses this issue.

One of the other issues that appeared was that with more than one queue, sharing the same update SolrJ client, commits on one thread would commit operations on the other thread. The classes are thread safe but the transactions inside the SolrJ client are not bound to a transaction context and so the SolrJ clients can’t be shared between transactions. We now bind SolrJ clients to transaction contexts.

The work doen at CSU will make it into the Solr bundle in the 1.3 release, and any changes that were made to the core code will undoubtedly make it into the managed project code base. A selection of lead developers from the managed project have access to the CSU private repository.



Solr Search Bundle 1.2 released

18 11 2011

The Solr Search v1.2 bundle developed for Nakamura has been released. This is not to be confused with Apache Solr 4. The bundle wraps a snapshot version of Apache Solr 4 at revision 1162474 and exposes a number of OSGi components that allow s SolrJ client to interact with the Solr server.

This release fixes a number of bugs related to concurrency and and the indexing operation identified by moving from a single threaded indexing operation to a muli threaded indexing operation. These bugs were introduced in the previous release. These were introduced by sharing a StreamingSorUpdateServer between multiple threads. Although the class is thread safe and after about June 2011 it does not hang, it does contain an internal memory based queue that asyncronously sends updates to a remote server. I should state, and you guessed it, that this only impacts situations where the Solr server being updated is not in the same JVM. The problem is that should any of the updates fail, no communication of that fact propagates back to the thread that performed the update operation. In the case of the Solr bundle we attempt to make the indexing queue reliable with a transactional, persistent queue. However, since we dont know if the update operation failed, we have no chance of working out what to do with the batch of updates being processed. This release fixes those issues. It also fixes a number of clustering and failover issues discovered at Charles Sturt University which I will leave for a follow up post.

Other improvements are listed with the issues fixed against this version, link below. As always, thanks goes to everyone who contributed and helped to get this release out.

Issues Fixed:
Release Tag:

Downloads are available from the release tag.

To Use from a maven2 project


Clustering Sakai OAE: Part I of ?

8 11 2011

Over the past month, on and off, I have been working with the developers at Charles Sturt University (CSU) to get Sakai OAE to cluster. The reasons they need it are obvious. Without clustering, a single instance of Sakai OAE will not scale upto the number of users they needed to support and  with a single instance and SLA on uptime would be pointless. So we have to cluster. Sakai has generally been deployed in a cluster at any institution where more than a handfull of students were using the system. Clustering Sakai CLE  is relatively straightforward compared to OAE. It can be done by configuring a load balancer with sticky sessions and deploying multiple app servers connecting to a single Database. However when an app server node dies with Sakai CLE, users are logged off and lose state. There are some commercial extensions to Sakai CLE using session replication with Terracotta that may address this.

State Management

In Sakai OAE, right at the start of the project, we wanted to move all state into the client, and make the server completely RESTfull. No state on the server at all. We also didn’t want to have any session footprint on the server. We wanted server capacity to be a function of request traffic and not a function of number of logged in users. With no session state, we wanted to ensure that when a app server node was lost, users authentication and authorization state was recreated on whichever new app server node they were allocated. In Sakai OAE, although we are deploying with sticky sessions, we are only doing that to reduce the need for all data to appear instantly everywhere in the cluster. So the custom component that manages the authentication state and authorization state of a user has dependencies on the deployment in the cluster.


The second thing we wanted to ensure when designing this area of the code base almost 2 years ago, was that we could efficiently track the location of the user within the content system. This may sound a bit odd, but it relates to the context of what they see on screen. If two users are looking at the same information at the same time, then UI components may want to encourage them to collaborate. This service also needs to be cluster aware since managing the transient location of users in persistent store would result in unsustainable update traffic. So the cluster tracking service/bundle was designed to track the location of users in memory, and segment that memory out to the app servers in the cluster, so that as we add more app servers we aggregate the memory usage for this service rather than replicate.

Events and Messaging

The final area that needs attention in a clustered environment is messaging. OSGi Events for internal messaging and bridge to JMS provided by ActiveMQ for remote transport. The configuration of messaging, which I still need to look at, is believed to be simpler than the above 2 services since it only requires configuring ActiveMQ in multi master mode. I say simpler, based on the GSoC 2010 work on streaming activity events into a Apache Cassandra based warehouse where we added conected with an ActiveMQ broker in multimaster mode. However, I seem to remember saying simple about the last months work, so I’ll have to wait and see on that one.


For those familiar with Sakai OAE you will know we are based on Apache Sling which itself is based on Apache Felix and Apache Jackrabbit. We use Apache Jackrabbit to store our “enterprise” content, namely the application content and institutional media. Since this is slow changing, read only, we treat each app server node as a standalone instance of Jackrabbit working of separate JCR images on local storage. This avoids the need to cluster Jackrabbit for either HA or scaling, not that it would need either of those used in this mode. We store our user generated content in a separate repository (Sparse) that is far less sophisticated than Jackrabbit, but covers that use case to satisfy our needs.

CSU Structure

At CSU the layers of the application are isolated from one another using a hardware load balancer (HLB), between each layer. IP requests are received on the front end host, distributed to a stateless webtier which does the normal forward proxy operation through another context on the HLB to the app server nodes. The tiers and LB’s are configured to maintain the original request context (Host+port), so the app server is configured to use that context to separate user content from application content. This is not the way many have configured OAE but it eliminates the need to distinguish requests based on the physical port on which they arrive.

To start with we deployed the OAE 1.0 release which came out about 2 months ago. This had not been run on a cluster before release so it was uncharted territory. Having configured the WebTiers and LB contexts and the app servers all worked perfectly, however none of the services mentioned above worked properly. When an app server died, the LB would switch and cause the user to loose all state. In general they became an anonymous user once more and would have to login again. Communication in the cluster for both state management and location is managed through EhCache. As with Sakai CLE there are local EhCache caches, and key based invalidation caches. Unlike Sakai CLE we also use replication within the cluster. Since it would overload the internal network of a cluster if we replicated every users state, we dont do that. Each server maintains a list of a set of private keys which it rotates on a 5 minute cycle. Once a key has been rotated 5 times, it become invalid. We use these keys to decrypt client cookies and identify the user. If the client cookie can’t be decrypted the authentication and authorization state expires. Obviously we re-issue the cookies when a new active key becomes available. The HMAC algorithm is SHA-512 and the first keys themselves are based on a UUID, which should make it reasonable hard to work out what the secret is before it becomes useless. For each app server in the cluster we need to replicate about 5×20 bytes of data once every 5 minutes, regardless of how many users there are in the system, but if that data is not replicated, sessions can’t migrate between app servers.

The location service has a slightly higher demand on internal network bandwidth, however, without having done any load testing I cant quantify that at present. This post is Part I of ?

OAE 1.0 Patches

The first problem, which why the patch to OAE 1.0 is do big, is that to replicate, the cache key and payload must be serializable (which they were) but the destination of the replication the cache payload (ie the classloader deep within EhCache) must also be able to convert the byte stream into a object instance. In OAE 1.0 there were 2 problems.

  • All the critical classes were private in their OSGi bundles, so the EhCache classloader could not load them.
  • Some of those classes held references to services, so when they were loaded the reference was missing and the code failed with NPE’s

Getting classloaders, especially those that never knew something like OSGi would exist, to recognise private classes is not always straightforward. We are using EhCache 1.5, which uses context classloaders in places. For the moment, the classes are exported as part of the public API and the patch has re-factored parts of the code base to ensure that references to services are not cached. We could have written a custom classloader to locate the bundle containing the private classes so the cache could load them, however it felt wrong to bypass the OSGi R4 classloader policy so exporting the cached classes looked like the best way. The downside of this approach is that any bundle in the JVM can now get hold of a replicated cache, and load the once private class, but then, were not running a security manager so that could be done anyway.

With the classloading issues fixed, the setup of EhCache itself is straight forward and as per the documentation now maintained by Terracotta. We use RMI based peer discovery and replication. The peer discovery uses multicast over a private group and a separate group port to ensure that we don’t accidentally join another cluster on the same subnet. Replication and invalidation is performed on a cache by cache configuration over RMI if required, and we have configured each cache by name to match its expected characteristics.

The patch to Sakai OAE 1.0 has been shared at and

Users now remain logged in when app server nodes die. Once we have the HLB configured correctly they wont notice, which opens the ability to elastically scale and perform rolling updates.

We have also configured the Solr component to run in a cluster, but I will save that for another post as its not complete and there are still many problems to be addressed.