Indexing and Query Performance in SparseMap 1.0, indicative tests

21 09 2011

Since early late May early June, it became apparent that the the table based indexing approach in Sparse Map used by Sakai OAE had problems. Performing queries on large key value tables can work, provided those queries are simple in nature and the volume of records in the table is not excessive. In fact parts of WordPress’s ontology store use this approach. Unfortunately in Sakai OAE the key value table grows at 10-120 times the rate of the main content store which grows at 2x the number of content items. In addition to this the queries that need to be performed on this table are paged, sorted and distinct. Not surprisingly that generates a performance issue. It first became apparent in Derby where it was really obvious. So obvious that Sakai OAE 1.0 RC1 would grind to a halt after running in integration test suite on Derby. Those issues were fixed for the Derby driver with the 1.0 release of SparseMap on which Sakai OAE 1.0 was based. Unfortunately, further testing shows that all other databases are effected. I say all, I mean MySQL and PostgreSQL since I dont have access to an Oracle DB at the moment to test… but there is a high probability that it will also be effected. To show the impact, here are some comparative graphs. The first one shows Derby query performance at the OAE 1.0 RC1 tag. Fortunately this is not what was released since SparseMap 1.0 contains a fix for the problem. At the time the consensus was that the problem did not effect MySQL or PostgreSQL and so some extents thats true, however detailed testing shows that the problem effects MySQL and PostgreSQL and presumably Oracle.

So what does this really mean ? It means that any page that relies on a data feed that uses SQL queries will slow dramatically. These tests are only for 10K objects, which sounds like a lot, but is not. Each content item needs 2 objects. A single user might have 20 content items in their home space, so with 100 users there are 4K objects before anyone has logged in. Once those users log in they will create pages and upload content. A content item typically has 100 embeded items, so the reality a small pilot of 100 users will reach 100K items in a few weeks.

Ignore the Derby OAE 1.0 RC1 graph, its irrelevant, but look at the MySQL OAE 1.0 graph. The pattern for the OAE 1.0 is a 4x increase in query times per doubling of number of rows. At 100K items queries may take upto 1000ms to execute, which means that a page with 1 SQL based feed will take at least 1s to return. Most pages have many more data feeds than that. Since this post is based on evidence, I will stop there and leave it to the reader to extrapolate.

What can or should be done ?

Short term:

  • Don’t run Sparse Map in key value indexing mode on RDBMS back ends if you have more than a few columns to index or are asking it to perform direct and complex queries. Ie dont use SparseMap 1.0. Unfortunately non of the NoSQL drivers support complex direct queries and all the RDBMS drivers in SparseMap 1.0 suffer from the problem… except Derby which strangely might be a better production DB for OAE 1.0 than MySQL or PostgreSQL.

Long Term:

  • Dont, unnecessarily create massive content trees with deep hierarchies to represent simple objects. Unpacking a JSON tree of single objects will do just this. If there is no need to do do it then its wasteful and would be better stored as a single property resulting in 1 or 2 content items per conceptual object.
  • Don’t rely on indexing as a substitute for publishing data in the right form for 99% of the queries (ie write exactly what you want to read once, and read it billions of times). Although the MySQL query cache does a good job of caching query results, in a situation where there is a high write traffic, the cache gets invalidated regularly.
  • Dont think that inverted indexes will eliminate the problem. All have latency or cardinallity issues and most have clustering issues which is why anyone running at scale in a cluster always puts a quality of service or data delivery parameter on data being written to data appearing in the index. There is plenty of material on this subject on the devops blogs at places like LinkedIn or Facebook.




4 responses

23 09 2011
Harry Wang

Hi Ian,

I have always been interested in using NoSQL database for OAE in production. I asked a similar question before and got an impression that the best DB choice for OAE production is PostgreSQL (is that still true?). Have you done any of the tests shown here using NoSQL databases? As you may know, Erik just developed a driver for MongoDB (which I have not tried yet). Is using NoSQL db for OAE production an ultimate goal for SparseMap?



23 09 2011

Hi Harry, At the moment I dont think the NoSQL DBs have the necessary level or query support to support OAE. OAE needs multi query terms with paging and sorting on large sets and at the moment we dont have that capability in any of the NoSQL DBs. As you can see from the posts MySQL has problems with the length of a row and query performance and indexing, all of which I hope are solvable. That leaves Derby, PostgreSQL and Oracle as potential production DBs. Derby may not sound like a good production DB, but the tests indicate its quite happy at 200K records with wide columns so it should be considered for single instance pilots. Eventually, once we have written native implementations of the query engines it should be possible to use the NoSQL DBs, however do not forget that there are some massive systems out there that use RDBMS’s (YouTubes metadata system is sharded MySQL quite simular in structure to the non indexing part of SparseMap, bits of Facebooks message system are PostgreSQL and FlockDB, twitters followers DB is sharded MySQL). Its perfectly possible to use a RDBMS in a way that works for a wide cluster.

I am working with Erik Froese to get the MongoDB driver merged. It has some nice additional features like GridFS and I think it may have better support for native querying.


26 09 2011
Harry Wang

Thanks a lot for these very useful information!

4 10 2011
Erik Froese

The MongoDB driver supports native multi-term queries as well as the simple OR queries when the $orset key is found. It will also use native MongoDB sorting if the _sort key is present in the query.

Paging could be added with the skip() and limit() functions. I didn’t see sparse ever using it so there wasn’t a clear way to add it. If SMC ever supports those operations it should be pretty straightforward to support them.

%d bloggers like this: