Deploying/Tuning SparseMap Correctly

26 02 2012


SparseMap is designed to maintain a shared cache of recently accessed maps in memory. The code base itself is also designed to use a little memory as possible. The SparseMap app server runs happily at load in 20MB of heap. Sakai OAE which is the main user of SparseMap uses a little more than that (around 200MB) leaving the remainder of heap available for caching. If caching is working correctly, and there is sufficient heap available for the number of active users the profile of calls to the storage layer should show almost no reads and a low level of writes. If however a mistake is made then the impact is dramatic. The first trace here shows Sakai OAE as of 2nd Feb 2012 running SparseMap 1.3 with a missconfigured cache setup. The image shows the SQL report from Neoload.

You can see that there a colossal number of SQL statements performing a query on parent hash and there are also massive number of other queries. Obviously something is not right.

Compare that with Sakai OAE on 23 Feb 2012 running SparseMap 1.5 with caching configured

The query profile has completely changed with almost everything being served from cache in this test. The 282189 queries taking 577s for parenthash has become 325 queries taking 0.645s The message here is, dont deploy SparseMap without caching enabled, and check that it is enabled and sized correctly. There are periodic log statements coming from SparseMap will indicate the performance of the cache which should always be running at over 80% hit rate.


SparseMap comes with a default configuration for SQL and DDL. It may be perfectly OK for most installations never needing any tuning, but the design and implementation assumed that deployers would tune both the DDL and SQL.

Tuning DDL

The DDL that comes with the RDBMS drivers is a default SQL schema. It makes the assumption that the deployment is going to be small to medium in size and probably never see more than 1M content items. If after sizing a production deployment its clear that the application will contain more than 1M items then some tuning of this DDL must be done. How much depends on how big the installation will be. The internal structure of SparseMap was designed to use database shards in the same way that YouTube’s metdata store does. The sharding is performed on the first 2 characters of the row key giving a theoretical maximum number of shards of 64^^2, although the configuration file will become unmanageable with that many shards.

Even if sharding is not required, the indexing operations within SparseMap will need tuning. If SparseMap should only be configured to index the column values that the application needs to index. By default there is only a single very wide indexing table which can become extremely inefficient. Its columns are by default large varchars and for many situations this will be very slow and wasteful. Once the client application knows what its indexing tables need to look like it should create those tables before SparseMap starts up so that each (yes there supposed to be more than one index table) table is dense and efficient with  properties that are queries and indexed by the same use case living in the same table. If the application has chosen to use non real time querying (eg by using Solr), then it should ensure that SparseMap is not unnecessarily indexing data that it will never use.

Tuning SQL

One of the main requirements for SparseMap was that it would allow UI devs to store and query any unstructured data without having to write a line of Java code. Consequently the queries it generates are not the most efficient. The design always assumed that deployers would at some time need to tune the SQL the application was hitting the database with, and they would also want to do that without touching a line of Java code. All the SQL is in property files to allow deployers to tune in production.

To assist a deployer in tuning, users of SparseMap should name all their queries by adding setting the property “_statementset” on each query to a different name. If that is done, then the deployer can bind customised tuned queries to each value of _statementset. The binding of queries also takes account of sharding if the storage layer has been sharded.

This only gives an introduction to deploying and tuning SparseMap. I would be astounded if SparseMap was so elegant that it could be deployed into production without ensuring caching was configured correctly, the DDL was adjusted and the SQL was tuned were appropriate.

For the observant amongst you the NeoLoad SQL reports have revealed a rather obvious bug that needs attention. SQL connections are bound to threads not to the pooled storage client implementation. When the storage client implementation is borrowed from the pool, the SQL connection associated with the thread performing the borrow operation is validated, in this case, Oracle, select 1 from dual is executed. Since the connection is thread bound this is unnecessary and accounts 496s or the 509s consumed by the load test. I intend to remove most of that in the next release as the validation approach is incorrect, and does not protect against SQL failure mid request. Correctly configured caching did, thankfully, reduce the SQL portion of this load test from 1064s down to 509s, although I think it should be possible to reduce that to arround 120s. At which point the load test will need to be upgraded.



Update: 29 Feb 2012

The high levels of select 1 from dual have been eliminated. Duffy Gillman from rSmart did the bulk of the work a few months ago in SparseMap 1.5 however he/we/I failed to remove all the places where connections were verified unnecessarily. The issue was fixed with commit  778a0bfe97963dccf46566a431853bab6f7c87cc which is available in the master branch of SparseMap for Sakai OAE to merge into their fork of the codebase. Hopefully, the improvement will show up in the next round of load testing, and should also remove JDBC as the top hotspot.

OpenID HTML and HTMLYadis Discovery parser for OpenID4Java

24 02 2012

OpenID4Java is great library for doing OpenID and OAuth. Step2 will probably be better but its not released. Unfortunately the HTML and HTMLYadis parsers rely on parsing the full HTML document and pull in a large number of libraries. These include things like Xerces and Resolver wich can cause problems if running multiple versions in the same JVM under OSGi. For anyone else wanting to eliminate dependencies here are Regex based parsers that have no dependencies outside code OpenID4Java and JRE.

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.openid4java.discovery.yadis.YadisException;
import org.openid4java.discovery.yadis.YadisHtmlParser;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class HTMLYadisDiscoveryParser implements YadisHtmlParser {

    private static final Logger LOGGER = LoggerFactory.getLogger(HTMLYadisDiscoveryParser.class);
    public String getHtmlMeta(String input) throws YadisException {
        Pattern head = Pattern.compile("\\<head.*?\\</head\\>",Pattern.CASE_INSENSITIVE | Pattern.DOTALL );
        Pattern meta = Pattern.compile("\\<meta.*?http-equiv=\"X-XRDS-Location\".*?\\>", Pattern.CASE_INSENSITIVE| Pattern.DOTALL);
        Pattern url = Pattern.compile("content=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
        Matcher headMatch = head.matcher(input);
        if ( headMatch.find() ) {
            Matcher metaMatcher = meta.matcher(;
            while( metaMatcher.find()) {                
                Matcher urlMatcher = url.matcher(;
                if ( urlMatcher.find() ) {
        } else {
  "No head found in {} ", input);
        return null;


import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.lang.StringUtils;
import org.openid4java.OpenIDException;
import org.openid4java.discovery.DiscoveryException;
import org.openid4java.discovery.html.HtmlParser;
import org.openid4java.discovery.html.HtmlResult; 
 public class HTMLDiscoveryParser implements HtmlParser {
   public void parseHtml(String htmlData, HtmlResult result) throws DiscoveryException {
        Pattern head = Pattern.compile("\\<head.*?\\</head\\>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
        Pattern link = Pattern.compile("\\<link.*?\\>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
        Pattern linkRel = Pattern.compile("rel=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
        Pattern linkHref = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
        Matcher headMatch = head.matcher(htmlData);
        if (headMatch.find()) {
            Matcher linkMatcher = link.matcher(;
            while (linkMatcher.find()) {
                String linkTag =;
                Matcher linkRelMatch = linkRel.matcher(linkTag);
                if (linkRelMatch.find()) {
                    Matcher linkHrefMatcher = linkHref.matcher(linkTag);
                    if (linkHrefMatcher.find()) {
                        String href =;
                        Set<String> terms = ImmutableSet.copyOf(StringUtils.split(, " "));
                        if (terms.contains("openid.server")) {
                            if (result.getOP1Endpoint() != null) {
                                throw new DiscoveryException("More than one openid.server entries found",
                        if (terms.contains("openid.delegate")) {
                            if (result.getDelegate1() != null) {
                                throw new DiscoveryException("More than one openid.delegate entries found",
                        if (terms.contains("openid2.provider")) {
                            if (result.getOP2Endpoint() != null) {
                                throw new DiscoveryException("More than one openid.server entries found",
                        if (terms.contains("openid2.local_id")) {
                            if (result.getDelegate2() != null) {
                                throw new DiscoveryException("More than one openid2.local_id entries found",

Monitoring SparseMap

24 02 2012

Yesterday I started to add monitoring to sparsemap to show what impact client operations were having and to give users some feedback on why/if certain actions were slow. There is now a service StatsService, with an implementation StatsServiceImpl that collects counts and timings for storage layer calls and api calls. Storage layer calls are generally expensive. If SQL storage is used that means a network operation. If a column DB storage layer is used, it may mean a network operation, but its certainly more expensive than not. Storage calls also indicate a failure in the caching. The StatsServiceImpl can also be configured in OSGi to report the usage of individual Sparse Sessions when the session is logged out. Normally at the end of a request. So using long term application stats should tell a user of SparseMap why certain areas are going slow, what storage operations need attention and if using SQL give the DBA an instant indication of where the SQL needs to be tuned. The short term session stats should enable users of sparsemap to tune their usage to avoid thousands of storage layer calls and eliminate unnecessary API calls.  Currently output is to the log file. Soon there will be some JMX beans and maybe some form of streamed output of stats. The log output is as follows.

app Key Slow Queries, T, N, Q 
app Slow Queries, 55, 1, cn:index_update:update cn_css_w set created = ? where rid = ?
app Slow Queries, 121, 1, ac:select:select b from ac_css_b where rid = ?
app Slow Queries, 272, 3, cn:select:select b from cn_css_b where rid = ?
app Slow Queries, 212, 3, cn:update:update cn_css_b set b = ? where rid = ?
app, Key Counters, ncalls, login, loginFailed, logout, slowOp, op, api
app, Counters, 2000, 2000, 0, 2000, 0, 4005, 0
app Key Storage, T, N, OP
app Storage, 2529, 2057, cn:select
app Storage, 1241, 1023, cn:index_update
app Storage, 888, 1004, cn:update
app Storage, 11, 18, cn:insert
app Storage, 21, 32, ac:select
app Storage, 10, 6, au:select
app Storage, 4224, 4001, n/a:validate
app Key Api Call, T, N, M
app Api Call, 11699, 999, org.sakaiproject.nakamura.lite.content.ContentManagerImpl.update
app Api Call, 217, 995, org.sakaiproject.nakamura.lite.accesscontrol.AuthenticatorImpl.signContentToken
app Api Call, 11, 1, org.sakaiproject.nakamura.lite.content.ContentManagerImpl.getInputStream
app Api Call, 691, 6061, org.sakaiproject.nakamura.lite.accesscontrol.AccessControlManagerImpl.signContentToken
app Api Call, 122, 2, org.sakaiproject.nakamura.lite.content.ContentManagerImpl.writeBody
app Api Call, 1250, 5059, org.sakaiproject.nakamura.lite.content.ContentManagerImpl.get
Output is in CSV form in the logs so the logs can be processed later the column headings have the following meaning.

T total time in ms
N number of times
M method
OP operation
Q query.

OP  is parsed as columnFamily:operation so cn:select means a select on the Content columnFamily.

For the DBA the Slow Queries are the ones to watch.

In this instance, at some point during the lifetime of this JVM

select b from ac_css_b where rid = ?

took 121 ms to run but it happend only once. (The laptop hung while performing a Time Machine backup)

This feature will be in version 1.6

SparseMap Content 1.5 released

19 02 2012

Sparse Map version 1.5 has been tagged (org.sakaiproject.nakamura.core-1.5) and released. Downloads of the source tree in Zip and TarGZ form are available from GitHub.

In this release 14 issues were addressed, the details are in the issue tracker.  This release contains critical fixes to the 1.3 and 1.3 release that make those releases unusable except for the smallest deployments. As with precious releases the SPI bundles for Derby and PostgresSQL have been released, however MySQL and Oracle SPI implementations have not been released in binary forms due to the licenses on the JDBC Drivers for those databases. There are tags in the source repositories for these releases. Again, I have also restrained from releasing the SPI implementations for Cassandra, HBase and MongoDB as I am not satisfied the implementations are sufficiently tested or complete.

If you find any issues, please mention them to me or, better still, add an issue to the issue tracker. Unless otherwise stated the license is Apache 2. Thanks to everyone who made this release possible.

Derby SPI Tag: PostgreSQL SPI Tag: MySQL SPI Tag: Oracle SPI Tag: 
Issues Fixed:

To use


The Jar is an OSGi bundle complete with Manifest, bundled with services. To use you will need to select a SPI implementation fragment bundle and deploy that with the core bundle. Normally this is done when the OSGi Standalone application jar is constructed. In addition to the core SparseMap bundle you will now need one of the the SPI implementation fragments.





MySQL, Oracle

Please check out the source code at the appropriate tag and build locally.


Search ACLs Part 2: Simple is always best

14 02 2012

I can’t take any credit for what I say next, all that goes to Mark Triggs who I spent some time chatting with last week. Simple is always best, and although a Bloom filter might be the CS way of solving membership resolution, its not the way to do it in an inverted index where we are really looking for the intersection between two sets. We already have lists of principals that can read each document associated with each document as a multivalued key word. This list is created by analysing the fully resolved ACL for each content item at the point of indexing. In general, content items dont have crazy numbers of ACLs so although the number of unique sets of read principals may be high, the number of read principals is not crazy high, so in that form cardinality etc are not overloaded.

The problem we encounter is that the simple query causes many query operations inside Solr so as the number of principals a user has rises the time taken for each of their queries also increases, into seconds. Not exactly much hope of web scale with that, and eventually the number of terms is too great for the Solr query parser, although by then I would expect the query to be taking minutes.

The solution, like all good things, is simple. Encode the read principals as a parseable token in the query (eg comma separated) and then use a custom query component compiles and caches a bitmap for the read principals, which are sorted. That part of the operation can be slow, for 1M documents, 100 readers per document, 10,000 possible readers if a user has 1000 principals Mark reports it taks order 250ms to build the bitmap first time from about 10 segments. Segments are immutable so as the current segment changes only parts of the bitmap need to be rebuilt. With 2M documents, 2000 principals per user it takes about 600ms to build the bitmap, again each segment is taking order 10ms to build the bitmap dependent on the size of the segment.

Once the bitmap is built and cached, subsequent queries are dominated by the other terms in the query. This looks like a solution that can be retrofitted to the existing index and should not require excessive load cache space or additional load, providing we cache with a key of segment/sorted-list-of-principals and are able to invalidate segment bitmaps as the segments get updated. In our case there are 2 classes of principal list. The anon user and logged in users. If we include the userID of the logged in user then we will probably get one cached bitmap per user. If we look for patterns in the lists of principals we can share the bitmaps more widely.

Mark can be found at, and the code at

Access Control Lists in Solr/Lucene

8 02 2012

This isn’t so much about access control lists in Solr or Lucene but more about access control lists in an inverted index in general. The problem is as follows. We have a large set of data that is access controlled. The access control is managed by users and they can individual items closed or open or anywhere between. The access control lists on the content, which may be files, or simply bundles of metadata is of the form 2 bitmaps, representing the permissions granted and denied, each pair of bitmaps being associated with a principal and the set of principal/bitmap pairs associated with each content item. A side complication is that the content is organised hierarchically and permissions for any one user inherit following the hierarchy back to the root of the tree. Users have many principals through membership of groups, through directly granted static principals and through dynamically acquired principals. All of this is implemented outside of the Solr in a content system. Its Solr’s task to index the content in such a way that a query on the content for an item is efficient and returns a dense result set that can have the one or two content items that the user can’t read, filtered out before the user gets to see the list. Ie we can tolerate a few items the user can’t read being found by a Solr query, but we cant tolerate most being unreadable. In the ACL bitmaps, we are only interested in a the read permission.

The approach I took to date was to look at each content item or set of metadata when its updated, calculate a set of all principals that can read the item and add those principals as a multivalued keyword property of the Solr document. The query, performed by a user computes the principals that the user has at the time they are making the query, and builds a Solr query that gets any document matching the query and with a reading principal in that set. Where the use of principals is moderate, this works well and does not overload either the cardinality of the inverted index where the reader principals are stored in Solr or the size of the Solr query. In these cases the query can be serviced as any other low cardinality query would be, by looking up and accumulating the bitmap representing the full set of documents for each reader principal in turn. The query then requires n lookups and accumulate operations, where n is the number of principals the user has, to resolve the permissions part of the query.

However, and this is the reason for this post, where this fails is where the cardinality of the reader principals becomes to high, or the number of principals that a user has is too high. Unfortunately those two metrics are connected. The more principals there are in a system, the more a user will need to access information, and so the reader principal mechanism can begin to break down. The alternative is just as unpleasant, where the user only has a single principal, their own. In those scenarios active management of ACLs in the content system becomes unscalable both in compute and human terms, which is why principals representing groups were introduced in the first place. Even if there were not limits to the size of a Solr query the cost of processing 1024 terms is prohibitive for anything other than offline processing.

Example of a Bloom filter

Image via Wikipedia

One solution that has been suggested is to use a Bloom filter to represent the set of principals that the user has and test each indexed principal against this filter. If this is done as part of the query, as the result set is being created there is no gain over scanning all documents since the inverted index would not be used. There could be some benefit in using this approach once a potential set of documents is generated and sorted, since the cost of performing sufficient hashes to fill the appropriate set of bloom buckets is low enough that it could be used as a post query filter. I think the term in Solr is a Collector. In this scenario we are already verifying the user can read a content item or its metadata before delivering to the user, hence its acceptable to have a less that perfect set of pointers being emitted from Solr, provided that the set of pointers we retrieve is dense. We can’t afford to have a situation where the set of pointers is sparse, say several million items, and the only item the user can read is the last one. In that scenario any permissions checking performed without the benefit of Solr would be super expensive.

So, the Bloom filter applied within Solr has the potential to be able to filter most result sets rapidly enough to create a result set that is dense enough for more thorough checking. How dense does it need to be and how large does the bloom filter need to be ? That is an open-ended question, however if, on average you had to read 20% more elements than you returned that might not be excessive if results sets were generally limited to no more than the first 100 items. If that’s the case then 80% density is enough. Bloom provides a guarantee of no false negatives but a probability of a % of false positives, ie items the a Bloom filter indicates are present in a set, but which, are not. For classical Bloom filters 20% is an extremely high probability of false positive, certainly not acceptable for most applications. It has been reported in a number of places that the quality of the hash function used to fill the buckets of the filter is also of importance in the number of false positives since an uneven distribution of hashes over the number space represented by the Bloom bitmap will result in inefficient usage of that bitmap. Rather than doing the math which you will find on Wikipedia, knowing that all fast hashes are less than perfect, and being a pragmatist I did an experiment. In Apache Hadoop there are several implementations of the Bloom filter with 2 evenly distributed and efficient hash functions, Jenkins and Murmur2, so I have used that implementation. What I am interested in is how big a filter would I need to get 80% density to a set of results and how big would that bitmap need to be as the number of inputs to the bloom filter (the users principals) rises. It turns out, that very small bitmap lengths will give sufficient density where the number of input terms is small, even if the number of tested principal readers is high. So 32 bytes of Bloom filter is plenty large enough to test with < 20 principals. Unfortunately however, the cardinality of these bitmaps is too high to be a keyword in an inverted index. For example, if the system contained 256K principals, and we expected users on average to have no more than 64 principals we would need a bloom filter of no more than 256 bits to generate, on average 80% density. Since we are not attempting to index that bloom filter the cardinality of 2^^256 is not an issue. Had we tried to, we would almost certainly have generated an unusable inverted index. Also, that Bloom filter is constructed for each users query, we can dynamically scale it to suit the conditions at the time of the query (number of items in the system, and number of principals the user has). Real system with real users have more principals and sometimes users with more principals. A system with 1M principals that has on average 1024 principals per user will need a bloom filter containing about 8Kbits. Its certain that adding a 8Kbit token ( or a 1Kbyte[] ) as a single parameter to a Solr query circumvents the issue surrounding the number of terms we had previously, but it’s absolutely clear that the cardinality of 2^^8196 is going to be well beyond indexing, which means that the only way this will work is to post filter a potentially sparse set of results. That does avoid rebuilding the index.

From this small experiment I have some questions unanswered:

  • Will converting a potentially sparse set of results be quick enough, or will it just expose another DoS vector?
  • What will be the practical cost performing 100 (principals) x 20 (I was using 20 hashes) into an 8kbit filter to filter out each returned doc item?
  • Will the processing of queries this way present a DOS vector?

Deprecate Solr Bundle

2 02 2012

Before that scares the hell out of anyone using Solr, the Solr bundle I am talking about is a small shim OSGi bundle that takes content from a Social Content Repository system called Sparse Map and indexes the content using Solr, either embedded  or as a remote Solr cluster. The Solr used is a snapshot from the current 4.0 development branch of Solr. Right, now thats cleared up I suspect 90% of the readers will leave the page to go and read something else ?

So Solr 4 works just great. The applications using Sparse Map, like Sakai OAE , have a high update rate and are adding to the index continuously. The bundle queues updates and processes them via a single threaded queue reader into the index which is configured to accept soft commits and perform periodic flushes to disk. The Solr instance is unmodified from the standard Solr 4 snapshot and we have had no problems with it. Provided the cadinality of the fields that the application indexes are not insane, and the queries are also achievable there are no performance issues with queries being the sub 10ms that we have all become accustomed to from Solr. Obviously if you do stupid things you can make a query in Solr take seconds.

There are however some issues with the way the bundle works and certainly when deployed into production into a real cluster there are issues. No one would seriously run the Sparse Map with this Solr bundle on a single app server node for anything other than development or testing, so the default Embedded Solr configuration is a distraction. If your not writing code with the intention of deploying into production, then why write the code? Life is to short, unless your an academic on track to a Nobel prize. When deployed, the bundle connects to a remote Solr master for indexing with one or more Solr slaves hanging off the master (polling not being pushed to). There are several problems with this configuration. If the master goes down, no index updates can happen. This doesn’t break the Solr bundle since it queues and recovers from master failure with a local write ahead transaction log or queue. It does break the indexed data on the master since anything in memory on the master will be lost, and only those segments on disk will get propagated to the Solr slaves when the master recovers. This is a rock and a hard place. 1s commits with propagation cause unsustainable segment traffic with high segment merge activity. Infrequent commits will just loose data and destroy data propagation rates. The slaves, being read only are expendable provided there are always enough to service the load. Thats sounds like the definition of a slave, I would not like to be one, but then I wouldn’t know if I was.

Solr, in this configuration, wasn’t really designed for this type of load. If we indexing new documents at the rate of 1 batch an hour then Solr in this configuration would be prefect. However the updates can come through at thousands per second. So although it works, its fine, but when it breaks it will break and leave the index in some unknown state. The problem is rooted in how the indexing is done and where the write ahead log or queue is stored. Its fine for a single instance since the write ahead log is local to the embeded Solr instance but no good for a cluster.

Other approaches

There are lots of ways to solve this problem. It was solved in Sakai 2 (CLE) search which treated segments as immutable and sent them to a central location for distribution to each app server. Writers on each app server wrote to local indexes and on commit the segment was pushed to a central location where the segment was pushed to all other app server nodes. The implementation was less than perfect and there were all sorts of timing issues especially when it came to merging and optimising. That code was written in 2006 on a very old version of Lucene (1.9.1 IIRC). So old it didn’t have commit, let alone soft commits and it was only used for relatively slow rates of update supporting non critical user functionality. Its in production many Sakai 2 schools. Every now and again a segment gets corrupted and that corruption propagates slowly over the whole cluster with each merge and optimise. Eventually full index rebuilds are needed which can be carried out when in full production but are best done overnight when the levels of concurrency are lower.

At the time we had considered using the DB based IndexReaders and IndexWriters from the Compass project. These were readers and writers that used a DB BLOB as the backing store. Lucene depends heavily on seek performance, and doing seek over a network into the DB blob, doesn’t work. The IO required to retrieve sections of the segments to pull terms is so high that search speed is a bit low (British understatement, stiff upper lip and all that). After tests those drivers were rejected for the Sakai 2 work. It might have worked on an Oracle DB where seeks in blobs is supported and you can do some local caching, but on MySQL it was a non stater.

The next approach is that used by Jackrabbit. The Lucene index is embedded in the repo. Every repo has a local index with updates being written directly to all index sychronised across the cluster. Works well on one app node, but suffers in a cluster since ever modification to the local index has to be serialised over the entire cluster. Depending on the implementation of that synchronisation it can make the whole cluster serialized on update. Thats ok if the use case is mostly read as it is with the Enterprise Content Management use case, but in a Social Content Repository the use case is much higher update. App servers cant wait in a queue to get a lock on the virtual cluster wide index before making their insert and inserting a pointer into a list to tell all others their done.

Since 2006 the world has not stood still and there have been lots of people looking at this space. LinkedIn opensources Zoie and Bobo that deliver batched updated into distributed indexes and then build faceted search from those indexes. Although these would work for a Social Content Repository my feeling was the quality of data service (time it takes from a content item update to the index presence) was too high and required lots of discipline in the coding of the application to ensure that data local to the user was published directly to the content system rather than discovered via the search index. The area of immediate impact of data for LinkedIn is well defined, the users view of their profile etc so that QoDS can be higher than where an update might have to instantly propagate to 100s of users. The types of use cases I was targetting with the Sparse were more like Google+ where groups take a greater prominence. Except in Education, the group interaction is real time which pushed the QoDS down into the second or sub second range. So Zoie was ground breaking, but not enough. The work on this application, now Sakai OAE, started in 2008 when there was nothing else (visible) around. We started with SLing based on Jackrabbit and use its search capabilities, until we realised that a Social Content Repository has to support wide shallow hierarchies with high levels of concurrent update the Enterprise Content Management model is deep narrow hierarchies with lower levels of concurrent update. See this for detail

Roll forwards to 2010 when we pulled in Solr 4 which was just about to get the NRT patches applied. It looked, bar the small issue of cluster reliability that it was an Ok choice. And now were up to date 2012 and the world of distributed search has moved on and I want to solve the major blocker of reliability. I don’t want to have to write a distributed index as I did for Sakai 2, partly because there are many others out there doing the same thing better than I have time to. I could use SolrCloud, although IIUC that deals with the cloud deployment of Shards of SolrSlaves rather than addressing the reliability of high volume updates to those shards.

Terms, Documents or Segments

What to shard and replicate. The ability to shard will ensure scalability in the index, which turns the throughput model from a task compute farm into a parallel machine using the simplest of gather scatter algorithms (my PhD and early research was numerical parallel solutions on early MPP hardware, we always looked down on gather scatter since if never worked for highly interconnected and dynamic problem sets, sorry if thats offensive to MapReduce aficionados, btw gather scatter is the right algorithm here). The ability to replicate, many times, will ensure that we don’t have to thing about making hardware resilient. But what to shard and replicate. The Compass IndexReader and IndexWriter DB implementation proved that inverted indexes need high seek speeds to minimise the cost of scanning segments for terms. Putting latency between the workings of the inverted index and its storage was always going to slow an index down and even if you made segment and terms local to processing, processing queries on partial documents (shards of terms) creates imbalance in the processing load of a parallel machine and dependence on the queries. The reason for less than perfect parallel speedup on numerical problems in 1990 was almost always due to imperfect load balance in the algorithm. Pausing the solution for a moment to wait for other machines to finish is a simple bottleneck. Even if sharding and replication of partial documents or terms balances over the cluster of search machines, the IO to perform anything but the simplest query is probably going to dominate.

So I need an index implementation that shards and replicates documents. Its 2012 and a lot has happend. The author of Compass Shay Banon (@kimchy) went on to write ElasticSearch with a bunch of other veterans. It looks stable and has considerable uptake with drivers for most languages. It abandons the store segments centrally model of Compass and Sakai 2 and replicates the indexing operation so that documents are shaded and replicated. Transporting a segment over the network after a merge operation, as Solr Master/Slave does is time consuming, especially if you have everything in a single core and you merged segment set have become many GB in size. This looks like a prime contender for replacing the search capability since its simple to run, self configuring and discovering and ticks all the boxes as far as scaling, reliability and ease of use.

Another contender is Lucandra. Initially this was just Lucene on top of Cassandara. It implemented the IndexReader and IndexWriter inside Cassandra without segments eliminating the need to manage segments but also loosing most of the optimisations of memory mapped data. Unlike the Compass IndexReader and IndexWriter that wrote segments to DB blobs the structure of the index is columns and rows inside Cassandra. Not dissimular from the Sparse Map Cassandra driver that indexes by writing its own inverted index as it goes. There are some performance gains since if you put the Lucandra class into the Cassandra JVM the data is supposedly local, however Cassandra data is replicated and shaded so there is still significant IO between the nodes and the solution may benefit from Cassandras ability to cache, but will still suffer from the same problems that all term based or partial document sharding suffers from. Poor performance due to IO. When Lucandra became Solandra a year later in the authors reported the performance issues, but also reported a switch to sharding by document.

There will be more out there, but these examples show that the close source community implementing large distributed indexes on a document based shard and replicate approach is the right one to follow. (Hmm isn’t that what the 1998 paper from some upstarts titled “The Anatomy of a Large-Scale HypertextualWeb Search Engine” said ? The authors of Solandra admit that it still looses many of the optimisations of the segment but rightly point out if your deploying infrastructure to manage millions of small independent indexes then the file system storage issue become problematic which is where the management of storage by Cassandra becomes an advantage. As of September 2011 I get the impression that ElasticSearch is more mature than Solandra, and although everyone itches these days to play with a new tool in production (like a column DB) and throw away the old and reliable file system, I am not convinced that I want to move just yet. Old and reliable is good, sexy and new always gets me into trouble.

I think, I am going to deprecate the Solr bundle used for indexing content in Sparse Map and write a new bundle targeting ElasticSearch. It will be simpler, since I can use the write ahead transaction log already inside elastic search, its already real time (1s latency to commits and faster than that for non flushed indexes). I have also found references to it supporting bitmap bloom filter fields which means I can now embed much larger scale ACL reader indexing within the index itself. A post to follow on that later. Watch this space.