Given enough Rope

31 10 2010

A bit of background. We have been experiencing bad performance for certain searches in Jackrabbit using Lucene. I always knew that sorting on terms not in the Jackrabbit Lucene index was a bad thing, but never looked much further than that until now.

The query that was problematic is

//*[@sling:resourceType='sakai/user-home'] order by  public/authprofile/basic/elements/lastName/@value

ie find all nodes of sling:resourceType = sakai/home and sort the result by a child node. On small sets of searches this appears reasonable, what happens when the numbers scale up, and what happens to alternative queries. Eg no sorting or sorting on a property

//*[@sling:resourceType='sakai/user-home']

of the target node.

//*[@sling:resourceType='sakai/user-home'] order by @lastName

So I loaded an instance with upto 6K user-home nodes, and then loaded with 6K non user-home nodes and did some experiments, graph below.

 

Pretty clear that the search based on child node does not scale at all and is not concurrent. Further investigation reveals that the sort operation has to load the child nodes inside the Lucene scorer direct from the JCR. with ACLs this is about 120K nodes. With > 1K nodes the LRU caches inside Jackrabbit are overloaded and none of the fetches are from cache, which is the underlying cause of the concurrency issue. In JR 2.1.1 the Persistence Manager is synchronized and so loads are sequential in the JVM. 4 threads all have to wait for one another to load the same data from the DB. Why this code was ever sequential I cant quite see, presumably because a non JDBC Persistence manager might need to be synchronized.

All of that aside, don’t even think about doing anything in a Jackrabbit search that references anything not on the target node, at even the smallest scale it wont perform, and don’t be fooled by a test on 100 items. Anyone who is using Lucene directly will know thats obvious; one of the dangers of a powerful query abstraction. Give a developer enough rope to hang themselves with, and they will.

 

The solution in this case, publish the sort order property to the node that will be sorted to get it into the index so its value can be used to sort in a simple Lucene Scorer that doesn’t access JCR at all.

 

Advertisements




Ever wondered by Skype doesn’t work at home.

29 10 2010

Have you got a router with DOS protection? Go an look at the logs and if it shows lots of denied UDP traffic, turn the DOS protection off and see what happens. Inbound Skype often uses UDP and some routers think that the inound traffic is a DOS attempt, blocking the packets and making the Skype audio sound like you are in a a cave network at best. It also kills VPN/IPSec performance. Mine went for 5KB/s upto 600KB/s when I turned it off, after posting this I will make certain my IP changes.

 





Our code is like custard

29 10 2010

When you hit it, it becomes like concrete. Deadlocked like concrete. Oh dear, having just cut an RC3 for our Q1 release we elected to spin up a server and friendly users have a play, confident that we had found all the contention issues. So confident that the JMeter test scripts had found everything we used those test scripts to populate the QA server with a modest number of users, groups and content. I knew something wasn’t quite right as I loaded up the content and got complaints of slow response over IRC; come on, content upload being performed by 12 users with only 6K users loaded should not cause others to see really slow response (define really slow ? minutes, no kidding). In a moment of desperation I take solace in the fact that even though some queries were taking minutes, the server was rock solid. (lol).

So with that background and my head firmly stuck in the sand, we went into the 1h bug bash to gently press the code base. At this moment I want you to visualise a large pool filled full of custard, about 30 people on the edge who jump in at the same time. Without much movement and a bit of shock the whole pool turns solid lump. Our code base is a non newtonian fluid, just like the custard in the swimming pool, and this blog post is stress relief before trying to find out where that deadlock is coming from, not what I want to be doing when I was supposed to be spending the weekend with family. If I find the cause and its not too embarrassing I might post it here.

 





Wrong Content Model

17 10 2010

I knew there was something wrong. You know that gut feeling you have about something, when it just doesn’t feel right, but you cant explain coherently enough to persuade others so eventually self doubt creeps in, and you go along with the crowd. We have a mantra, borrowed phrase really, borrowed from JCR. “Content is everything”. Its possible it was borrowed without knowing what it really meant. One of those emperors new clothes things, this time the emperor really was wearing cloths, and they were so fantastic that perhaps we thought they would fit our build, so we borrowed them, not quite realising what that meant.

One of the founding principals of Sakai 3 is that content should be shared everywhere. That expands to being re-used, repurposed, reowned everywhere. To achieve that, there are two solutions give content ownership and allow users who own the content to organise and manage that content how they feel fit, including adding their own structure to the content if they feel thats what the want to do. Alternatively, put all the content into a big pool and let users find it, tag it, apply ontologies to it and crucially express access rights at the individual item level. Neither approach is right, neither is wrong. The former has compromises when ownership is changed the latter has compromises when each uses needs to manage large volumes of content individually. I am not going to say which is better, been burnt too many times trying to apply engineering logic to user experience design decisions, but one thing I do know is that the underlying implementation for one approach is very different from the underlying implementation for the other. Getting them the wrong way round is likely to lead to problems.

So the UX design process decided that the big pool approach was right. Often quoted in discussions was doing things like Google Docs. Handing round pointers to documents, identified by a key at the end of each URL opaque in meaning to the end user, but immensely scalable.  Expressing access rights as “share this item with my friends”, “make this public”, or “I’m happy for anyone who has this URL to edit this”. In that there was no expression of where the document lived, no organisation of the document into containers and certainly no management of access on containers, although it interesting to see that the limitation of managing large volumes of documents in a flat list has lead Google Docs to introduce folders where like minded documents get shared with collaborators by virtue of their location. The content model is scalable on two levels. On a technical level, its easy to generate billions of non conflicting keys per second and its easy to shard and replicate the content associated with those keys on a global scale migrating information to where its needed avoiding the finite limitations of the speed of light and routers. On a human level the machine generated key and per user hierarchy eliminates all conflicts. Google would be bust if it offered help desk resolution of conflicting document URLs where 2 users demanded to share the same web address for their document. By allocating gmail username on a first come first serve basis, Google managed to get tacit acceptance of a given naming scheme, without helpdesk load. How do you persuade Google to give you the userid of another user, just because you believe you have a right to it, and they got there first? Dream on ?

We chose a content technology because we wanted something that worked really well and covered the generalised use case of Content Management. The content system we have is hierarchical and maps URLs directly to content paths, exploiting hierarchal structure to make it easier to manage large volumes of data. It hits the sweet spot of Enterprise Content Management, but it it right for us ? In Sakai 3, UX design has chosen a pooled content model for all the same use cases as covered by Google Docs, but we are building it on a content system that requires agreement of URL space, agreement of locations within the content system and critically uses that hierarchy to drive efficiencies on many levels. Hierarchy is fundamental to the Java Content Repository specification, fundamental to Jackrabbits implementation and to some extents a natural part of the http specification. Attempting to layer a pooled content system over a fundamentally hierarchical storage system is probably a recipe for disaster on two levels. Technically it can be done, we have done it, but as my gut tells me we are beginning to find out, it wasn’t a good idea. All those efficiencies that were core to the content repository model are gone, exposing some potentially quite expensive consequences. At a human level we have side stepped the arguments over who owns what url in a collaborative environment by obfuscating the URL, but in the process we have snatched back from the user the ability to organise their content by hierarchy. The help desk operations that support Sakai 3 wont go bust processing URL conflict resolution since users dont get URLs to be conflicted over.

What should we do? I think we should admit that the models are separate and not try and abuse one user experience with the wrong supporting implementation or conversely force an implementation to do what it was not designed for. We have a crossover and we have to choose. For want of better words, we have to chose Content Management User Experience supported by content management storage living true to everything is content embodied by David’s model, or Pooled User Experience supported by UUID based object storage intended for massive scale. Mixing the two is not an option.

Hindsight is a wonderful thing to learn from; mistakes made ? Yes, we have mixed up a choice on a technical level with political desires forgetting that in a design led development process, technical decisions must be made purely to service the design. If the political desires were important, they should have adjusted the design process from the start. I live and learn.





Jackrabbit Performance

15 10 2010

Having spent a month or so gradually improving performance in Nakramura I thought I should share some of the progress. First off, Nakamura uses Sling which uses Jackrabbit as the JCR, but Nakamura probably abuses many of the JCR concepts, so what I’ve found isn’t representative of normal Jackrabbit usage. We do a lot of ACL manipulation, we have lots of editors of content, with lots of denys, we have groups that are dynamic, all of this fundimentally changes the way which the internals of Jackrabbit respond under load, and its possible that this has been the root cause of all our problems. Still be have them and I will post later with some observations on that.

We were due to do a release at the end of September, but load testing (way to late) showed that we had major performance problems. To be fair, I had heard reports from other projects of similar problems but arrogantly thought that much earlier load testing showed that we didn’t have a problem, more fool me. We found that on read requests the server was essentially single threaded never making use of extra cores. Below is a shot of the thread trace from YourKit. Red shows where threads are blocking and doing nothing, bear in mind that there are 200 threads and the screen only shows a few. All were blocking. No wonder it was fast with 1 thread but really slow with 200.

After a bit on investigation I found, 1 SystemSession being shared by all request threads, blocking synchronously. Fortunately using OSGi bundles its easy to rebuild a jar with patched classes, so we patched that area to attach sessions to workspaces and threads within the Access Control Manager. Although that reduced the blocking we had to also reduce memory consumption since we now had 200 system sessions all building up state. So I added SoftReferences and positive eviction to ensure that no Session became too old and bloated and should the JVM GC activity get to high it would forcibly evict bloated sessions. Turns out the standard Sun JVM GC doesn’t look too far down SoftReferenced object trees to see what it can evict, and so it wasn’t that good at evicting caches many hops removed from the soft reference, which is why we age the sessions based on use and lifespan. This does have an impact on performance, but it also reduces long term memory growth and results in less overall GC activity. Coincidentally this also eliminated we deadlock I had been chasing for several weeks. A writer thread would take out a exclusive write lock in the SharedItemManager, and holding on to that lock start to update Items. In that process the ACLProvider would need a read lock using its SystemSession, however in the meantime another thread had entered the same system session via a synchronized block excluding all other threads and was waiting for a shared read lock in the SharedItemManager. Hey presto, deadlock. Pretty soon all reader threads queue up trying to get a shared read lock and the server stops responding. YourKit were kind enough to give me a license to use on Open Source, so I should say, it would have been much harder to find that one without the thread trace and monitors in YourKit (thanks guys)

That was fine, it reduced blocking in the entry to the ItemManager bound to each session but soon exposed blocking the next layer down. Jackrabbit makes heavy use of LRUMaps from Commons Collections. These are non thread safe LRUMaps implementing an efficient LRU algorithm. Absolutely fine if you really have single threaded operations, but not so good if anything concurrent gets in. Turns out its not to hard to re-implement a LRUMap concurrently (ok I cheated and extended ConcurrentHashMap) so I set about replacing some of the problematic ones with thread safe versions.

By this stage, read concurrency was Ok, with occasional waits for read locks in the SharedItemManager. There is a fix in Jackrabbit Trunk for concurrency, due to be released in JR2.2, but unfortunately there has been a lot going on between 2.1 and the head of trunk, and the stream of 30 or patches made more of mess than I was happy with, so we couldn’t make further progress on read performance. The next worrying thing that we noticed was that under high file upload to a content pools, the server was back into single threaded mode. After investigation it turned out that the content pool structure we are using, which is a large key pool requires that every item has its own set of ACLs. Content models in Jackrabbit are normally hierarchical, thats what it was built for. If every file has lots of ACLs, the invalidation traffic flowing through the Access Control Providers become dominant, and worse, the standard implementation shares the SystemSession reserved for request threads. This means that for every ACL modification, all threads on all requests get blocked. Removing that blockage by placing the modifications in a concurrent queue for later removal by the request thread appears to have eliminated all the blocking under load. Below is a trace from today. All our changes are in the server bundle of Nakamura, next target search performance, but I still have a nagging feeling that if we ignore David’s model, we should not be using JCR for content storage.





Performance and Releases

14 10 2010

Why does everyone do performance testing at the last minute ? Must be because as a release date approaches the features pile in and there is no time to test if they work or let alone perform. Sadly thats were we are with Nakamura at the moment. We have slipped our release several times now and are seeing a never ending stream of bugs and performance problems. Some of these are just plain bugs, some are problems with single threaded performance and some are problems with concurrency. Plan bugs are easy to cope with, fix them. This late in the cycle the fixes tend to be small in scope because we did at least do some of the early things right, the other two areas are not so easy.

Native performance

We are using Jetty in NIO mode embedded inside OSGi, and see throughput of around 5k requests/s under concurrent load. Once we add in the first part of the Sling request processing chain in this drops to around 1200/s which is about the peak throughput we can get before we do some serious processing inside Sling. Its interesting to notice the impact of doing anything on throughput. Every layer adds latency that hits throughput, still OOTB the stack is doing Ok. An extensive benchmarking experiment at http://nichol.as/benchmark-of-python-web-servers shows that Java and Jetty even in threaded mode is just as good as some fo the event mode deployments of Python WSGI servers, and with a little work could be better. The Achilles heel of most Java applications is the ease with which developers add bloated complexity and soon find that their 5k requests/s drops to 100/s concurrent. Fortunately Jetty in OSGi shows all the signs of being concurrent. The same cannot be said for the rest of Nakamura.

Single Threaded Performance

In the rush to satisfy the feature stream we have relied to heavily on search. Search comes from Lucene within Jackrabbit and when used correctly its fast, giving first results in sub ms timeframes. However, we are using generic indexes and some of the queries we have to deliver to the UI are not simple, gather data from deep trees and attempt to perform joins throughout the content tree. With an infinite amount of time before the release we would not have done this and would have built custom indexes targeted precisely at the queries that we needed to support, however we didn’t, and now we have a problem. Some queries on medium size deployments are down to 2s or more, single threaded. This is pretty awful when you think back to the targets set some time ago of sub 20ms for queries. Dont even ask about multithreaded.  Now, we can probably fix these problems by doing some detailed indexing configuration deep within Jackrabbit.

Multi Threaded Performance

Disclaimer: Nakamura satisfies a slightly different use case from Sling and Jackrabbit, and so we have made modifications to some of the Jackrabbit code base. Nakamura supports content management where everyone can write content. Sling and Jackrabbit support content management were only a few can write content. That said we have been seeing lots of issues surrounding synchronization blocking concurrent operations deep within Jackrabbit, on read operations. We have found that the default deployment of Jackrabbit contains a single SystemSession per Workspace that is used for access control and where most of the threads capable of writing to the JCR, all threads block synchronously on the SystemSession supporting the access control manager. This doesn’t happen in standard Jackrabbit since it can declare the entire workspace read only and read granted an bypass this bottleneck. Still that makes the server single threaded and limits throughput to < 100 request/s. To put that in context Apache Httpd + PHP on the same box wont even get near that figure which is why Moodle (written in PHP) tends to be good for schools. However that bit of good news doesn’t help me.

I fixed the bottleneck problem, reduced the memory footprint and increased stability by binding System Sessions supporting the access control operations to workspaces then threads. Doing this multiplies the memory footprint, but I also evict these sessions on an age basis to prevent them accumulating the entire ACL state of the repository, so under load the server actually uses less memory than before. This is great as it eliminates the synchronization bottleneck and fixes a rare deadlock condition we were seeing in Jackrabbit 2.1 where a writer would block on synchronization because a reader that held the synchronization was waiting for the read lock held by the writer. Unfortunately all this has done is expose the next layer of contention down the stack. In Jackrabbit, past the ItemManager which is owned by the Session, there is a SharedItemManager shared by all sessions. Inside there are concurrent read locks and exclusive write locks. Under load we see these dominating limiting the server to a throughput of around 600 request/s. At this point we are waiting for the next release of Jackrabbit 2.2 which looks like it might have addressed some of these problems. Since we see the throughput drop with more than 10 threads due to concurrency in the locks we are limiting our servers from 200 threads down to 10, queuing up all requests in the Jetty acceptor where the impact is minimal. As soon as you do this on a search URL, all bets are off since one of those can block everything for 2s or turning throughput form 600 to 10/s.

Outcome

In real terms with the poor search performance that means we can probably only support 100 users per JVM, this isn’t anywhere near enough. We can’t deploy in a cluster since that won’t help sideways scalability much, although there are some patches that could be applied to the ClusterNode implementation to keep the journal sequential but not bound by a transaction lock in the DB. There are more things we could do and I am thinking about. Replace the Jackrabbit RDBMS PM with Cassandra. Write a Jackrabbit SPI based on Cassandra, (I think the SPI is above the SharedItemManager). Write a Sling Resource Provider using Cassandra bypassing Jackrabbit altogether but we would have to unbind from javax.jcr.*, or something else, completely different.

 

Now, back to our Q1 release.