GroupEdit and RWiki

28 07 2006

Looks like there are others with the same off line Collaborative document writing. If you took Sakai + RWiki and exposed its services as web services, it would be a short step to integrating a full blown off line Collaborative Document Writing environment.

You would need a client, but taking the RWiki Render engine, and embedding it as an Eclipse Application would generate quite a nice cross platform environment, not Wisiwig, but then for this type of work, wisiwig sometimes gets in the way, just look at Wikipedia for positive proof of what happens without when you sacrifice wisiwig for collaboration.

Just a thought.

ContentHosting JSR-170

27 07 2006

For those who dont know JSR-170 is the Java Content Repository specification. Apache has recently released a 1.0 implementation with Jackrabbit, which looks good. Day Software who formed a large part of that project are using something similar in commercial products that they sell. If you believe their web site (no reason not to) they have some solid names as customers.

For those who don’t know, Sakai has a Content Hosting Service, that uses a proprietary implementation that has survived and been improved from the Chef days.

Having done a bit of work in Content Hosting already, producing a plugin mechanism as a patch in Sakai 2.1.2 and 2.2, I feel that it has some shortcomings. These observations are not a criticism as it does what is does very well. But I feel that there are some aspects of the implementation that get in the way. For instance, Collections and Resources are separated objects. The storage of objects or nodes within the content hierarchy is done in such a way as to make extension difficult. I have a feeling that the data access patterns are causing performance problems with WebDAV.

Rather than whine about it to the community, I’ve decided to have a go at using a JCR under the ContentHostingServce to re-implement this service. If it works it might mike life easier.

Jackrabbit performance looks good and the storage architecture matches what is implemented in Sakai so should work in a cluster. The intention is to use the Jackrabbit WebDAV, and inject the Sakai security model at the service level. The JCR will become akin to an RDBMS, where a back end users acts on behalf of a role in the front end.

Search Deployments 2

27 07 2006

The Second problem that has been found in Search in production is that the number of index segments grows.

As the index is updated, new search documents are added to new segments that are merged into the current segment. When the size of the segment reaches a certain size, it is retired and a new segment is started. Retired segments are still active, they just don’t have any new documents added to them. This is relatively normal behavior, Nutch does something similar, with more sophistication.

The problem that was happening at one production site, what that they had a reasonable number of documents, and they had rebuilt the entire index several times. Hence the created a fairly large number of segments. Eventually the JVM said ‘too many files open’.

It turns out that a small bit of code that was ensuring the SearchIndex reload was atomic was leaving file handles in memory waiting for the GC to close them. So I fixed that stupid mistake. But it also made me think about what does happen to a huge index. I’ve done testes with 200 segments with no problems, but as the corpus grows there must come a point where there are just too many files.

So I also implemented a segment merging algorithm. This merges segments together on an order of magnitude algorithm. At the end of each search operation, the index builder (which has cluster wide lock), sorts the segments into size an based on the first segment tries to merge 10 other segments within the same order of magnitude, it then looks for the next order of magnitude and tries to do the same. It ignores the current segment, as we want that one to be small and fast.

This works well, it creates a log base 10 structure to the segments greatly reducing the number of segments present. The downside is that when a merge happens it is potentially a large event. Imagine when the larges segments are in the 1G range, and 10G merge might take a bit of time, even if the infrastructure can support > 2G files.

At the end of that, bar some minor UI bugs, search is up in production at least one, non Cambridge site. Cambridge are naturally going be deploying Search on all 3 of their sites.

Search Deployments 1

27 07 2006

It looks like there are a number of sites actively deploying Sakai Search in production. Thankfully the stream of requests for fixes has been relatively low volume, so I did something right.

There were 2 main areas where production teams found issues. Firstly its clear that Hibernate/Spring is next to hopeless when it comes to long running transactions in multiple threads managed by Spring transaction proxies. Its not that Hibernate and Spring are bad, in fact they are excelent, but they are not so good away from the request cycle.

What has been observed is that is there is a background thread, such as a queue processor, contending with a foreground request thread, the queue adder, then the transaction policy on the back ground thread has to be tightly controlled to ensure that number of rows it maintains locks over is minimized. Without this, either thread may wait for a lock timeout, Ok for the background thread, bad for the request thread.

If you manage to do that, you will still get some lock contention in a high load environment, so what happens when you do. The application should behave in a sane and predictable way. And this is where hibernate starts to come unstuck. Since you don’t have precise control over what statements are issued to the database and when, you quickly find that the batch update of Hibernate is mildly uncontrollable, and so you spend a lot of time trying to predict and recover from these failures safely.

So, I recoded the Queue processor to use pure JDBC, and found control was regained. No more failures, I know exactly which records are locked, and I can recover sensibly. I still use Hibernate to build the schema and in other areas. In fact having started out with a hibernate implementation, recoding in JDBC was easy since there was a good structure there.

Sakai Search Update

16 07 2006

I have been doing quite a bit of work recently improving the performance and operation of the search indexer in Sakai. Problems have been mainly around using Hibernate in long running transactions. It is so unpredictable in how it touches the database, it means that long running transactions will lock large areas of the database… which causes other threads, connected to users requests to randomly exhibit stale object, optimistic locking and lock timeout failures. This is only made worse in MySQL where the lock is placed on the primary key index, which means that neighboring records are also locked (if they don’t exist).

So the search index builder has not been running continuously inside Sakai for 48 hours building a 4GB index.

There are still some issues outstanding to be aware of. 1. PDF documents that contain a lot of drawing instruction. I have have some particularly bad examples of PDF documents that take several minutes to load in Preview or Acrobat. This is due to a huge number of drawing instructions in the PDF. When search encounters one of these documents under extremely heavy load, I have seen it take 10 minutes to process the document as it has to layout all the drawing instruction. This is rare occurance, and occasionally will make the search indexer believe that the indexer thread has died, and so it will remove the lock.

If this happens the work don’t by the first thread will be lost, and a new thread will take over the work. The first thread will die gracefully, but It may cause a slowdown in the rate of indexing.

If for any reason a PDF document takes > 15 minutes to read, then it will block the indexing queue and you will have to remove that document or improve its encoding. I am thinking about ways of eliminating this problem.

2. After time the number of segments in the indexer increases. New segment builds are merged into the current segment. At the moment, old segments are not merged. This means that if the site is continuously rebuild, the number of segments will increase and the disk space will be consumed.

In addition each update will cause DB space to be consumed in the form of redo logs although only the current version of the segment is maintained.

At somepoint I need to write a class to perform merging of segments, that can, as part of the index build do a complete merge operation and reduce the number of segments.