Sakai Search Update

16 07 2006

I have been doing quite a bit of work recently improving the performance and operation of the search indexer in Sakai. Problems have been mainly around using Hibernate in long running transactions. It is so unpredictable in how it touches the database, it means that long running transactions will lock large areas of the database… which causes other threads, connected to users requests to randomly exhibit stale object, optimistic locking and lock timeout failures. This is only made worse in MySQL where the lock is placed on the primary key index, which means that neighboring records are also locked (if they don’t exist).

So the search index builder has not been running continuously inside Sakai for 48 hours building a 4GB index.

There are still some issues outstanding to be aware of. 1. PDF documents that contain a lot of drawing instruction. I have have some particularly bad examples of PDF documents that take several minutes to load in Preview or Acrobat. This is due to a huge number of drawing instructions in the PDF. When search encounters one of these documents under extremely heavy load, I have seen it take 10 minutes to process the document as it has to layout all the drawing instruction. This is rare occurance, and occasionally will make the search indexer believe that the indexer thread has died, and so it will remove the lock.

If this happens the work don’t by the first thread will be lost, and a new thread will take over the work. The first thread will die gracefully, but It may cause a slowdown in the rate of indexing.

If for any reason a PDF document takes > 15 minutes to read, then it will block the indexing queue and you will have to remove that document or improve its encoding. I am thinking about ways of eliminating this problem.

2. After time the number of segments in the indexer increases. New segment builds are merged into the current segment. At the moment, old segments are not merged. This means that if the site is continuously rebuild, the number of segments will increase and the disk space will be consumed.

In addition each update will cause DB space to be consumed in the form of redo logs although only the current version of the segment is maintained.

At somepoint I need to write a class to perform merging of segments, that can, as part of the index build do a complete merge operation and reduce the number of segments.

Clustered Index in Search

9 06 2006

After spending many hours with a hot CPU, I finally came up with and efficient mechanism of indexing in real time in a database clustered environment. I could have put the Lucene Index segments directly into the database with a JDBCDirectory from the Compass Framework. But unfortunately the MySQL configuration of Sakai prohibited the emulation of seeks within BLOBs, so the performance was hopeless. Im not convinced that emulating seeks in BLOBS actually helps as I think the entire BLOB might still be streamed to the App server from the database.

Normally you would run Lucene using Nutch Distributed File System, which borrows some concepts the the Google File System. NDFS is a self healing shared nothing file system tuned for use with Lucene…. but its not easy to set up from within a Java app, and it has to have some dedicated nodes, dedicated to certain tasks.

Failing that you might run rsync at the end of each index cycle to sync the Index onto the cluster nodes. I think this was the preferred method prior to NDFS. However, its a bit difficult to get to EXT3 inodes from within a Java app, and Sakai runs on Windows and Unix, so I cant rely on the native rsync command.

The solution that has just gone out to the QA community, was to use the local FSDirectory to manage local copies of the index segments, and once an index write cycle is complete, distribute the modified segments via the database. In testing, I tried this against MySQL with about 10GB of data in 200-300 K documents. It worked Ok. I’m waiting with baited breath to see how many JIRA items as posted against this, as everything that flows over the Sakai Entity bus is seen by the indexer. Nice not to have a component that just gets tested whatever is done in QA!

Sesame RDBMS Drivers

6 06 2006

I’ve written a Data source based Sesame driver, but one thing that occurs to me in the Sakai environment. Most production deployments do not allow the application servers to perform DDL operations on the database. Looking at the default implementations, thats the non data source ones, they all perform lots of DDL on the database in startup and in operation. This could be problem for embedding Sesame inside Sakai. I think I am going to have to re-implement from scratch the schema generation model. It might even be worth using Hibernate to build the schema although it not going to make sense to use Hibernate to derive the object model, the queries are just too complex and optimized.

Sesame RDBMS Implementation

5 06 2006

It looks like there are some interesting features in the Sesame default RDBMS implementation. Since it uses its own connection pooling, it tends to commit on close. If the standard connection pool that is used by default is replaced by a java.sql.Datasource, things like commit don’t happen when Sesame thinks they should have happened. The net result is a bunch of exceptions associated with lock timeouts, as one connection coming out of the data source block subsequent connection. The solution looks like its going to be to re-implement most of the RDBMS layer with one that is aware of a Datasource rather than a connection pool.

Sesame in a Clustered environment

4 06 2006

Sesame has one major advantage in a clustered environment, it stores its content in a database. Im not saying this is good thing, but it just makes it easier to deploy in a clusterd environment where the only thing that is shared is the database. It should be relatively easy to make it work OOTB with Sakai… however, it looks like the default implementation of the Sesame RDBMS Sail driver (this is the RDF Repository abstraction layer) like to get a jdbc url, user name and password. This would be Ok, except that Sakai likes use a Data source.

The solution appears to be to extend various classes within the Sesame core rdbms implementation so that whenever a connection is required it comes from the Sakai data source rather than some separately managed JDBC pool.

Its not clear at the moment is Sesame is scalable enough to handle the potential number of triples that Sakai will generate. The tests of the Lucene part of the search engine were indexing about 5GB of data representing about 100,000 documents. Performance was perfectly acceptable. If the same document set was to put into a triple store, we will see at least 2M triples, and thats before we start to add in any work site ontology beyond the standard Sakai ontology.

If we get to this size of RDF store, we should also consider using Kowari but with an entirely native index format we might have to employ similar techniques to those used in the Lucene clustered search to make it work. Alternatively we could look at a dedicated RDF server… although I suspect that this would be too much deployment effort for most users.

Semantic Search

4 06 2006

Currently Search performs its indexing on text streams. There is a significant amount of information that can be extracted from entities, beside the simple digest of content. This includes things like the entity reference, the URL, title, description etc. There is also other information. We could create multiple indexes for this in Lucene quite easily, but it would not necessarily provide the search structure that is required. A better approach is probably going to be to represent this in RDF. So Im going to try and enhance the EntityContentProcuder with an RDF stream and place a pluggable RDF triple store underneath the search engine to operate as a secondary stream. Its quite possible that this will solve some of the search clustering problems and will certainly address the results clustering that would begin to make search really cool.