Making the Digital Repository at Cambridge Fast(er)

18 12 2012

For the past month or so I have been working on upgrading the Digital Repository at the University of Cambridge Library from a heavily customised version of DSpace 1.6 to a minimally customised version of DSpace 3. The local customizations were deemed necessary

List of commandments from Lev. 9–12; numbered Halakhot 8–18; includes a list of where they appear in Maimonides’ Book of Commandments and the Mishneh Torah, and possibly references to another work.

to achieve the performance required to host the 217,000 items and the 4M metadata records in the Digital Repository. DSpace 3 which was releases at the end of November 2012 showed promise in removing the need for many of the local patches. I am happy to report that this has proved to be the case and we are now able to cover all of our local use cases using a stock DSpace release with local customizations and optimizations isolated into an overlay. One problem however remains, performance.

The current Digital Repository contains detailed metadata and is focused on preservation of artifacts. Unlike the more popular Digital Library which has generated significant media interest in recent weeks with items like “A 2,000-year-old copy of the 10 Commandments” , the Digital Repository does not yet have significant traffic. That may change in the next few months as the UK government is taking a lead in the Open Access agenda which may prompt the rest of the world to follow. Cambridge, with its leading position in global research will be depositing its output into its Digital Repository. Hence, a primary concern of the upgrade process was to ensure that the Digital Repository could handle the expected increase in traffic driven by Open Access.

Some basics

DSpace is a Java web application running in Tomcat. Testing Tomcat for a trivial application reveals that it will deliver content at a peak rate of anything up to 6K pages per second. If that rate were sustained for 24h, 518M pages would have been served. Unfortunately traffic is never evenly distributed and applications always add overhead but this gives an idea of the basics. At 1K pages/s 86M pages would be served in 24h. Many real Java webapps are capable of jogging along happily at that rate. Unfortunately DSpace is not. It’s an old code base that has focused on the preservation use case. Many page requests perform heavy database access and the flexible Cocoon based XMLUI  is resource intensive. The modified 1.6 instance using a JSP UI delivers pages at 8/s on a moderate 8 core box and the unmodified DSpace 3, using the XMLUI instance a 15/s on a moderate 4 core box. Surprisingly, because the application does not have any web 2.0 functionality to speak of, even at that low level it feels quite nippy as each page is a single request once the page assets (css/js/png etc) are distributed and cached. With the Cambridge Digital Library regularly serving 1M pages per hour, Open Access on the Digital Repository at Cambridge will change that. Overloaded DSpace remains solid and reliable, but slow.

Apache Httpd mod_cache to the rescue

Fortunately this application is a publishing platform. For anonymous users that data changes very slowly and the number of users that log into the application is low. The DSpace URLs are well structured and predictable with no major abuse of the HTTP standard. Event the search operations backed by Solr are well structured. The current data set of 217K items published as html pages represents about 3.9GB of uncompressed data, less if the responses are stored and delivered gzipped. Consequently configuring Apache HTTPD with mod_cache to cache page responses for anonymous users has a dramatic impact on throughput. A trivial test with Apache Benchmark over 100 concurrent connections indicates a peak throughput of around 19K pages per second. I will leave you to do the rest of the maths. I think network will be the limiting factor.

Loosing statistics

There are some disadvantages to this approach. Deep within DSpace statistics are recorded. Since the cache will serve most of the content for anon users these statistics no longer make sense. I have misgivings about the way in which the statistics are being collected since if the request is serviced by Cocoon, the access is recorded in a Solr Core by performing an update operation on the core. This is one of the reasons why the throughput is slow, but I also have my doubts that this is a good way of recording statistics. Lucene indexes are often bounded by the cardinality of the index. I worry that over time the Lucene indexes behind the Solr instance recording statistics will overflow available memory. I would have thought, but have no evidence, that recording stats in a Big Data way would be more scalable, and in some ways just as easy for small institutions (ie append only log files, periodically processed with Map Reduce if required). Alternatively, Google Analytics.


Before you rush off and mod_cache all your slow applications there is one fly in the ointment. To get this to work you have to separate anonymous responses from authenticated responses. You also have to perform that separation based on the request and nothing else, and you have to ensure that your cache never gets polluted, otherwise anonymous users, including a Google spider, will see authenticated responses. There is precious little in an http request that a server can influence. It can set cookies, and change the url. Applications could segment URL space based on the role of the user, but that is ugly from a URI point of view. Suddenly there are 2 URIs pointing to the same resource. Setting a cookies doesn’t work, since the response that would have set the cookie is cached, hopefully without the cookie. The solution that worked for us was segment authenticated requests onto https and leave anon requests on http. Then configure the URL space used to perform authentication such that it would not be cached, and ensure an anon users never accessed https content, and an authenticated user, never accesses http content. The latter restriction ensures no authenticated content ever gets cached and the former ensures that the expected tsunami of anon users doesn’t impact the management of the repository. Much though I would have liked to serve everything over a single protocol on one virtual host the approach is generally applicable to all webapps.

I think the key message is, if you can host using Apache Httpd with mod_mem_cache or even the disk version, then there is no need to jump through hoops to use exotic applications stacks. My testing of Dspace 3 was done with Apache HTTPD 2.2 and all the other components running on a single 4 core box probably well past its sell by date.

Sakai CLE ElasticSearch

11 10 2012

A long time ago, I wrote a search module for Sakai 2 as CLE was known then. It attempted to make every node in a CLE instance share the load of indexing and searching and make the search aspect of a CLE cluster scale elastically. To some extents it worked, but it had problems. The indexing queue was persisted in a DB table and it was based on a old version of Lucene that didn’t have anything as useful as commits. Consequently it could get its segments into a bit of mess at times. The world has moved on in the 5 years since I wrote that code, and two viable alternatives for supporting Search in Sakai CLE have emerged. Apache Solr and Elastic Search. Both can be run as remote servers or embedded. Both are solid reliable releases. It could be argued that Solr has more support for sophisticated index schema, and it’s probably true that Elastic Search is easier to deploy for elastic scaling and real time indexing as that’s its default behaviour.

For those wanting to try Sakai CLE with Apache Solr as the search server then look no further than the work that Adam Marshall has been doing at Oxford University. That allows you to spin up a Solr instance and connect your Sakai CLE instances to it. You will have to do some reasonably sophisticated master slave configuration to make it resilient to failures and don’t expect the indexing operations to be real-time. There are plenty of references to the work required to do that in this blog, and arguments why I currently prefer ElasticSearch over Solr.

Deployment and reliability

ElasticSearch comes out the box being real-time, elastic and cloud aware, with built-in AWS EC2 knowledge as well as rack awareness. Its been built to shard, partition and replicate indexes out of the box. The ElasticSearch client as I am finding out is simple to embed into most environments including OSGi and when embedded makes each app server node a part of elastic search cluster. Best of all, for the nervous by nature, is the resilience that comes from spinning up more than 3 instances in the same cluster. In fact, I have been finding it hard to damage elastic search indexes in tests. It’s perfectly possible to do all of this with Solr, but the deployer has to work a little harder adding some custom components to support a writeahead log and a Zookeeper instance to manage the cloud.

Metadata Indexing

Probably the best part of ElasticSearch is the client which is a fully multithreaded client following the same pattern Communicating Sequential Processes first described by Tony Hoare and one of the motivators for the Go language. This allows a client for submit suitably light weight indexing requests to the ElasticSearch cluster via an embedded client without needing to think about managing a queue or the latency of indexing. This nice little feature turns the 1000 lines of code I had to write for Sakai CLE  and OAE search into about 20. Initial tests show that indexing can be done within the request loop and because of the true real-time nature ElasticSearch with its write ahead log, results are available about 50ms after the transaction commits. To maintain that latency, I only index metadata via this route. Document indexing takes a different route.

Document Indexing

I found with the original Sakai 2 search and subsequent Solr based indexing of documents in Sakai OAE that indexing bodies was expensive. In some instances tokenizing office documents could place extreme strain on a JVM heap. For that reason when I did the indexing service in the Django version of OAE I did two things. I offloaded the document body indexing operations to separate processors driven by a queue of events, following the CSP pattern mentioned above, and I made the content store single instance. Where users collaborate, they often upload the same document. With a single instance content store, only a single instance of a document is stored and hence, tokenizing and information extraction is only performed once. This greatly reduces the cost of indexing. The store isn’t collision perfect but by performing a hash on the document body as its saved its possible to eliminate most if not all collisions. Certainly SHA1(ing) enough of the body eliminates all collisions.

So the document indexing processes use the index to locate documents that need to be indexed and then use the single instance content store to eliminate duplicate tokenizing. Using this approach in the Sparse Content Map content system which is already single instance has a dramatic impact on IO. Sakai CLE Content Hosting Service is not single instance at present but could be adjusted to be so once hashes are known. It would be nice to fix that aspect of CHS at some point.

Current state

I am still working on this code, and this post is part notes, part notification should I get distracted. My testbed is the Sparse Content Map content system only because it builds in 20s, starts in 5, has full integration test coverage and compliant webdav support thanks to Milton. There is currently nothing in the code base that prevents it using Spring or a Webapp container as opposed to OSGi, and the coupling is loose being event driven. The best part is the result should scale as far as ES can scale which is probably a lot larger than any CLE instance in production.

Deprecate Solr Bundle

2 02 2012

Before that scares the hell out of anyone using Solr, the Solr bundle I am talking about is a small shim OSGi bundle that takes content from a Social Content Repository system called Sparse Map and indexes the content using Solr, either embedded  or as a remote Solr cluster. The Solr used is a snapshot from the current 4.0 development branch of Solr. Right, now thats cleared up I suspect 90% of the readers will leave the page to go and read something else ?

So Solr 4 works just great. The applications using Sparse Map, like Sakai OAE , have a high update rate and are adding to the index continuously. The bundle queues updates and processes them via a single threaded queue reader into the index which is configured to accept soft commits and perform periodic flushes to disk. The Solr instance is unmodified from the standard Solr 4 snapshot and we have had no problems with it. Provided the cadinality of the fields that the application indexes are not insane, and the queries are also achievable there are no performance issues with queries being the sub 10ms that we have all become accustomed to from Solr. Obviously if you do stupid things you can make a query in Solr take seconds.

There are however some issues with the way the bundle works and certainly when deployed into production into a real cluster there are issues. No one would seriously run the Sparse Map with this Solr bundle on a single app server node for anything other than development or testing, so the default Embedded Solr configuration is a distraction. If your not writing code with the intention of deploying into production, then why write the code? Life is to short, unless your an academic on track to a Nobel prize. When deployed, the bundle connects to a remote Solr master for indexing with one or more Solr slaves hanging off the master (polling not being pushed to). There are several problems with this configuration. If the master goes down, no index updates can happen. This doesn’t break the Solr bundle since it queues and recovers from master failure with a local write ahead transaction log or queue. It does break the indexed data on the master since anything in memory on the master will be lost, and only those segments on disk will get propagated to the Solr slaves when the master recovers. This is a rock and a hard place. 1s commits with propagation cause unsustainable segment traffic with high segment merge activity. Infrequent commits will just loose data and destroy data propagation rates. The slaves, being read only are expendable provided there are always enough to service the load. Thats sounds like the definition of a slave, I would not like to be one, but then I wouldn’t know if I was.

Solr, in this configuration, wasn’t really designed for this type of load. If we indexing new documents at the rate of 1 batch an hour then Solr in this configuration would be prefect. However the updates can come through at thousands per second. So although it works, its fine, but when it breaks it will break and leave the index in some unknown state. The problem is rooted in how the indexing is done and where the write ahead log or queue is stored. Its fine for a single instance since the write ahead log is local to the embeded Solr instance but no good for a cluster.

Other approaches

There are lots of ways to solve this problem. It was solved in Sakai 2 (CLE) search which treated segments as immutable and sent them to a central location for distribution to each app server. Writers on each app server wrote to local indexes and on commit the segment was pushed to a central location where the segment was pushed to all other app server nodes. The implementation was less than perfect and there were all sorts of timing issues especially when it came to merging and optimising. That code was written in 2006 on a very old version of Lucene (1.9.1 IIRC). So old it didn’t have commit, let alone soft commits and it was only used for relatively slow rates of update supporting non critical user functionality. Its in production many Sakai 2 schools. Every now and again a segment gets corrupted and that corruption propagates slowly over the whole cluster with each merge and optimise. Eventually full index rebuilds are needed which can be carried out when in full production but are best done overnight when the levels of concurrency are lower.

At the time we had considered using the DB based IndexReaders and IndexWriters from the Compass project. These were readers and writers that used a DB BLOB as the backing store. Lucene depends heavily on seek performance, and doing seek over a network into the DB blob, doesn’t work. The IO required to retrieve sections of the segments to pull terms is so high that search speed is a bit low (British understatement, stiff upper lip and all that). After tests those drivers were rejected for the Sakai 2 work. It might have worked on an Oracle DB where seeks in blobs is supported and you can do some local caching, but on MySQL it was a non stater.

The next approach is that used by Jackrabbit. The Lucene index is embedded in the repo. Every repo has a local index with updates being written directly to all index sychronised across the cluster. Works well on one app node, but suffers in a cluster since ever modification to the local index has to be serialised over the entire cluster. Depending on the implementation of that synchronisation it can make the whole cluster serialized on update. Thats ok if the use case is mostly read as it is with the Enterprise Content Management use case, but in a Social Content Repository the use case is much higher update. App servers cant wait in a queue to get a lock on the virtual cluster wide index before making their insert and inserting a pointer into a list to tell all others their done.

Since 2006 the world has not stood still and there have been lots of people looking at this space. LinkedIn opensources Zoie and Bobo that deliver batched updated into distributed indexes and then build faceted search from those indexes. Although these would work for a Social Content Repository my feeling was the quality of data service (time it takes from a content item update to the index presence) was too high and required lots of discipline in the coding of the application to ensure that data local to the user was published directly to the content system rather than discovered via the search index. The area of immediate impact of data for LinkedIn is well defined, the users view of their profile etc so that QoDS can be higher than where an update might have to instantly propagate to 100s of users. The types of use cases I was targetting with the Sparse were more like Google+ where groups take a greater prominence. Except in Education, the group interaction is real time which pushed the QoDS down into the second or sub second range. So Zoie was ground breaking, but not enough. The work on this application, now Sakai OAE, started in 2008 when there was nothing else (visible) around. We started with SLing based on Jackrabbit and use its search capabilities, until we realised that a Social Content Repository has to support wide shallow hierarchies with high levels of concurrent update the Enterprise Content Management model is deep narrow hierarchies with lower levels of concurrent update. See this for detail

Roll forwards to 2010 when we pulled in Solr 4 which was just about to get the NRT patches applied. It looked, bar the small issue of cluster reliability that it was an Ok choice. And now were up to date 2012 and the world of distributed search has moved on and I want to solve the major blocker of reliability. I don’t want to have to write a distributed index as I did for Sakai 2, partly because there are many others out there doing the same thing better than I have time to. I could use SolrCloud, although IIUC that deals with the cloud deployment of Shards of SolrSlaves rather than addressing the reliability of high volume updates to those shards.

Terms, Documents or Segments

What to shard and replicate. The ability to shard will ensure scalability in the index, which turns the throughput model from a task compute farm into a parallel machine using the simplest of gather scatter algorithms (my PhD and early research was numerical parallel solutions on early MPP hardware, we always looked down on gather scatter since if never worked for highly interconnected and dynamic problem sets, sorry if thats offensive to MapReduce aficionados, btw gather scatter is the right algorithm here). The ability to replicate, many times, will ensure that we don’t have to thing about making hardware resilient. But what to shard and replicate. The Compass IndexReader and IndexWriter DB implementation proved that inverted indexes need high seek speeds to minimise the cost of scanning segments for terms. Putting latency between the workings of the inverted index and its storage was always going to slow an index down and even if you made segment and terms local to processing, processing queries on partial documents (shards of terms) creates imbalance in the processing load of a parallel machine and dependence on the queries. The reason for less than perfect parallel speedup on numerical problems in 1990 was almost always due to imperfect load balance in the algorithm. Pausing the solution for a moment to wait for other machines to finish is a simple bottleneck. Even if sharding and replication of partial documents or terms balances over the cluster of search machines, the IO to perform anything but the simplest query is probably going to dominate.

So I need an index implementation that shards and replicates documents. Its 2012 and a lot has happend. The author of Compass Shay Banon (@kimchy) went on to write ElasticSearch with a bunch of other veterans. It looks stable and has considerable uptake with drivers for most languages. It abandons the store segments centrally model of Compass and Sakai 2 and replicates the indexing operation so that documents are shaded and replicated. Transporting a segment over the network after a merge operation, as Solr Master/Slave does is time consuming, especially if you have everything in a single core and you merged segment set have become many GB in size. This looks like a prime contender for replacing the search capability since its simple to run, self configuring and discovering and ticks all the boxes as far as scaling, reliability and ease of use.

Another contender is Lucandra. Initially this was just Lucene on top of Cassandara. It implemented the IndexReader and IndexWriter inside Cassandra without segments eliminating the need to manage segments but also loosing most of the optimisations of memory mapped data. Unlike the Compass IndexReader and IndexWriter that wrote segments to DB blobs the structure of the index is columns and rows inside Cassandra. Not dissimular from the Sparse Map Cassandra driver that indexes by writing its own inverted index as it goes. There are some performance gains since if you put the Lucandra class into the Cassandra JVM the data is supposedly local, however Cassandra data is replicated and shaded so there is still significant IO between the nodes and the solution may benefit from Cassandras ability to cache, but will still suffer from the same problems that all term based or partial document sharding suffers from. Poor performance due to IO. When Lucandra became Solandra a year later in the authors reported the performance issues, but also reported a switch to sharding by document.

There will be more out there, but these examples show that the close source community implementing large distributed indexes on a document based shard and replicate approach is the right one to follow. (Hmm isn’t that what the 1998 paper from some upstarts titled “The Anatomy of a Large-Scale HypertextualWeb Search Engine” said ? The authors of Solandra admit that it still looses many of the optimisations of the segment but rightly point out if your deploying infrastructure to manage millions of small independent indexes then the file system storage issue become problematic which is where the management of storage by Cassandra becomes an advantage. As of September 2011 I get the impression that ElasticSearch is more mature than Solandra, and although everyone itches these days to play with a new tool in production (like a column DB) and throw away the old and reliable file system, I am not convinced that I want to move just yet. Old and reliable is good, sexy and new always gets me into trouble.

I think, I am going to deprecate the Solr bundle used for indexing content in Sparse Map and write a new bundle targeting ElasticSearch. It will be simpler, since I can use the write ahead transaction log already inside elastic search, its already real time (1s latency to commits and faster than that for non flushed indexes). I have also found references to it supporting bitmap bloom filter fields which means I can now embed much larger scale ACL reader indexing within the index itself. A post to follow on that later. Watch this space.