Sparse Map Content Store

16 11 2010

While doing soak tests and endless failed release candidates for the Sakai3/Sakai OAE/Nakamura Q1 release I have been experimenting with a content store. This work was triggered by a desire to have support for storage of users, groups and content with access-control that a) clustered and b) didn’t have concurrency issues. I also wanted something thin and simple so that I could spend more time of making it fast and concurrent than having lots of features. The original implementation was done in Python (btw, thats my second Python app), in about 2 days on top of Cassandra, but the Sakai community is Java so although that was really fast to implement and resulted in code that had all the concurrency required, it wasn’t any good for the bulk of the community. (Community over code). The re-implementation in Java has taken several weeks part time, but along the way APIs have appeared, and what was hard coded to work on Cassandra, is now based on an abstract concept of a sparse map, or sparse array. Data is stored in rows, and each row contains a set of columns from a potentially infinite set of columns, ie a Column DB. All of the objects (Users, Groups and content) can be stored in that structure and with modification based operations on the underlying store the API is amenable to being stored in many ways. So far I have drivers for Memory (trivial Concurrent Hash Maps), JDBC (MySQL and Derby) and Cassandra. Content bodies are slightly problematic since some of these stores are not amenable to streaming GB of data (Cassandra & JDBC) so the driver also has a blocked storage abstraction that collects together chunks of the body into block sets. As currently configured a block set is 64 1MB blocks in a single row stored as byte[] columns. Where a streamable source is available (eg File) there is a driver to stream content in and out direct from file. The code base has minimal dependencies, the main one being Google Collections for its map implementations and so it should be possible to create a driver for BigTable and run inside an App Engine.

All of this is vaguely interesting, but how does it perform?

With simple layers and no synchronisation the Memory based driver is almost perfectly concurrent when tested on my MacBook Pro provided the JVM can get the CPU cores. It creates uses at about 20K users per second without any optimisation, which in itself would create some interesting opportunities to queue modifications asynchronously. It also reads and writes blocked content at about 1GB/s which is probably just the memory bandwidth inside the JVM, indicating that the block abstraction is Ok. Those trivial tests just confirm that the thin upper layers don’t introduce concurrency issues. The JDBC driver on MySQL peaks at around 100 user creations per second with streaming at about 50MB/s write and 1GB/s read which is again probably direct form memory cache. Concurrency efficiency is not high with MySQL indicating some fundamental synchronization in the JDBC driver or MySQL configuration, as expected. The Cassandra driver gives about 3K user creations per second, is concurrent upto the number of cores, writes in blocked mode at 13MB/s and reads at 70MB/s. Although the streaming speed seen in this mini viability test is not great, it looks like block storage of bulk content inside Cassandra is viable. Obviously all of this a very rough test with threads from multiple processed fighting each other for CPU cores, but it does show that.

  • So far the implementation is concurrent, limited by the underlying storage tech.
  • A Sparse Map/Sparse Array abstraction for content storage works on Column, Memory and JDBC drivers with relative ease.
  • Performance looks adiquate for my needs.
  • Implementing something like this in Python with test coverage is way quicker than in Java, but also less engineered.

Next steps:

  • look at read and write through cache drivers, as all the above has no caching enabled. (other than the memory driver)
  • See if a Jackrabbit User Manager can be implemented over the top of this.
  • See if a Sling Resource Resolver can be implemented over the top of this.

Code is at but now that the Q1 release of Nakamura has gone out, I should really get back to the day job.

YouTube Snippets

3 11 2010

One of the perks of being a member of University of Cambridge is you can (are actively encouraged) to attend Lectures, in any department, on any subject. I think I am right in saying 99% are open to any member of the University. Every now and again the Computer Labs has a speaker worth listening to, Oliver Heckmann, Director of Engineering, Google Zurich and his talk “A Look Into Youtube – The World’s Largest Video Site” was one of those especially seeing as a few hours earlier Turkey reimposed their ban on YouTube for what they claimed was unsuitable content, identified by Dr Heckmann’s content ID system. He was relaxed, unflustered by the robust stance Google Inc’s chief council was taking, reported minutes before by Reuters, to paraphrase, probably incorrectly,  “….. censorship by any one country is an attack on US free trade… “, non US readers might be wondering about Global free trade at this point.

Aside from the relaxed state of mind and the non technical war of words raging over the North Atlantic, there were so some interesting things, that I believe are public, in fact I think the whole talk was public.  YouTube is a Python app, that still uses Apache Httpd and a sharded read mostly MySQL backend for metadata and web content. The main reason behind this was speed of implementation, which having done a prototype content system in Python on Cassandra, I can believe.  Its heaviest service is the thumbnail service which has 20x the requests of other services and in the early days (cough 2005 I think) some muppet put all the thumbnails on disk as individual files, soon overwhelming the inodes available on the filesystem. The talk even mentioned “…. in one folder ….” but I don’t think I believe that, making machine recovery take many hours ( I do believe that). That surprised me, since even the backups we do which are inode based overwhelm the OS and tools like rsync as we found out before 2005. So all of thats in BigTable now, but I question why the thumbnails are not embedded in a CSS files and streamed out as a slower changing set to cover the extremely high rate pages. That would be a 20x saving in that area. Perhaps they are, the talk was thinner on detail the closer to today it became. Video content is all streamed over HTTP, from lighttpd which uses a more event based structure than Apache HTTPD, although I think Apache HTTPD may be changing. Why lighttpd ? Because with long lived httpd connections streaming content, a few threads can service the transport of bytes to sockets without the need for lots of sophistication or the need to tie threads down to sockets. With that approach I will guess that the OS is tuned with more space allocated to socket maintenance balancing the number of sockets to 1 thread per core shipping data out over the network cards.

The thought provoking part of the talk was the approach to copyright management, the reassuring part was, if what was presented isn’t a million miles from today, there isn’t a magical world where anything is possible behind the Google IPR wall, just bright engineers finding solutions to problems that they encounter in an environment that makes that task a little easier. The last question from the audience on how Google pays corporate taxes to host countries, made me smile. Dr Heckmann wisely denied all knowledge of that part of the business.

Version on Create Concurrency

2 11 2010

In Jackrabbit 2.1.1 (possibly fixed in later version) if you create nodes with mix:versionable added to them, the version history will be created which will block other threads performing writes and Persistence manager level reads. If the persistence manager is a JDBC based persistence manager and other threads are attempting to find items that are not in the shared cache, reads will be also be blocked, as they need to access the database by reading. Remember the PM is a singled threaded transaction monitor in JR 2.1.1 and earlier. So creating an item with mix:versionable where many concurrent requests are performing the operation results in a thread trace as below (red == blocked threads).


Removing the mix:versionable, appears to remove almost all thread blocking as below. It may not look like it but in the case below there are more concurrent request running than in the example above, the time per request is significantly less.


So add mix:versionable only when you need to add it at the point of checkin.