Sakai Search Update

16 07 2006

I have been doing quite a bit of work recently improving the performance and operation of the search indexer in Sakai. Problems have been mainly around using Hibernate in long running transactions. It is so unpredictable in how it touches the database, it means that long running transactions will lock large areas of the database… which causes other threads, connected to users requests to randomly exhibit stale object, optimistic locking and lock timeout failures. This is only made worse in MySQL where the lock is placed on the primary key index, which means that neighboring records are also locked (if they don’t exist).

So the search index builder has not been running continuously inside Sakai for 48 hours building a 4GB index.

There are still some issues outstanding to be aware of. 1. PDF documents that contain a lot of drawing instruction. I have have some particularly bad examples of PDF documents that take several minutes to load in Preview or Acrobat. This is due to a huge number of drawing instructions in the PDF. When search encounters one of these documents under extremely heavy load, I have seen it take 10 minutes to process the document as it has to layout all the drawing instruction. This is rare occurance, and occasionally will make the search indexer believe that the indexer thread has died, and so it will remove the lock.

If this happens the work don’t by the first thread will be lost, and a new thread will take over the work. The first thread will die gracefully, but It may cause a slowdown in the rate of indexing.

If for any reason a PDF document takes > 15 minutes to read, then it will block the indexing queue and you will have to remove that document or improve its encoding. I am thinking about ways of eliminating this problem.

2. After time the number of segments in the indexer increases. New segment builds are merged into the current segment. At the moment, old segments are not merged. This means that if the site is continuously rebuild, the number of segments will increase and the disk space will be consumed.

In addition each update will cause DB space to be consumed in the form of redo logs although only the current version of the segment is maintained.

At somepoint I need to write a class to perform merging of segments, that can, as part of the index build do a complete merge operation and reduce the number of segments.


Clustered Index in Search

9 06 2006

After spending many hours with a hot CPU, I finally came up with and efficient mechanism of indexing in real time in a database clustered environment. I could have put the Lucene Index segments directly into the database with a JDBCDirectory from the Compass Framework. But unfortunately the MySQL configuration of Sakai prohibited the emulation of seeks within BLOBs, so the performance was hopeless. Im not convinced that emulating seeks in BLOBS actually helps as I think the entire BLOB might still be streamed to the App server from the database.

Normally you would run Lucene using Nutch Distributed File System, which borrows some concepts the the Google File System. NDFS is a self healing shared nothing file system tuned for use with Lucene…. but its not easy to set up from within a Java app, and it has to have some dedicated nodes, dedicated to certain tasks.

Failing that you might run rsync at the end of each index cycle to sync the Index onto the cluster nodes. I think this was the preferred method prior to NDFS. However, its a bit difficult to get to EXT3 inodes from within a Java app, and Sakai runs on Windows and Unix, so I cant rely on the native rsync command.

The solution that has just gone out to the QA community, was to use the local FSDirectory to manage local copies of the index segments, and once an index write cycle is complete, distribute the modified segments via the database. In testing, I tried this against MySQL with about 10GB of data in 200-300 K documents. It worked Ok. I’m waiting with baited breath to see how many JIRA items as posted against this, as everything that flows over the Sakai Entity bus is seen by the indexer. Nice not to have a component that just gets tested whatever is done in QA!

Section Group Support for Wiki in Sakai

9 06 2006

There is already some support for Groups and Sections in Sakai RWiki. This is basic support that connects a Wiki SubSpace to a Worksite group. If the connection is made (by using the name of the group as the SubSpace name), permissions are taken from the Group permissions. There is a wiki macro that will generate links to all the potential Group/Section SubSites in a Worksite (see the list of macros in the editing help page)

This is a simple approach that is probably understandable, but its not exactly sophisticated or flexible. So, being a glutton for UI punishment, we have started to open up the concept further.

The concept is, that for any node in the Wiki hierarchy, thats Wiki Pages or Wiki Subsites, you (an maintain or admin user) can configure which permissions ‘realm’ is associated with the node, edit the permission on the roles, add/delete roles in that ‘realm’, modify permissions associated with the role, add/remove users from a role.

A can of worms! The challenge is not in creating the functionality, any thing is possible. The challenge is with creating a UI that doesn’t confuse the hell out of anyone other than the developer that created.

One view on this is that its better to stick with simple statements that control the permissions and not expose the full power of the underlying permissions system. Such a statement might be ‘Lock this page’. I think I agree with that for an access type users, but for a user who is maintaining a worksite, this may not be enough power. I am going to have to do many mock ups to uncover all the issues. The advanced permissions editing may not make 2.2.

Sesame RDBMS Drivers

6 06 2006

I’ve written a Data source based Sesame driver, but one thing that occurs to me in the Sakai environment. Most production deployments do not allow the application servers to perform DDL operations on the database. Looking at the default implementations, thats the non data source ones, they all perform lots of DDL on the database in startup and in operation. This could be problem for embedding Sesame inside Sakai. I think I am going to have to re-implement from scratch the schema generation model. It might even be worth using Hibernate to build the schema although it not going to make sense to use Hibernate to derive the object model, the queries are just too complex and optimized.

Sesame RDBMS Implementation

5 06 2006

It looks like there are some interesting features in the Sesame default RDBMS implementation. Since it uses its own connection pooling, it tends to commit on close. If the standard connection pool that is used by default is replaced by a java.sql.Datasource, things like commit don’t happen when Sesame thinks they should have happened. The net result is a bunch of exceptions associated with lock timeouts, as one connection coming out of the data source block subsequent connection. The solution looks like its going to be to re-implement most of the RDBMS layer with one that is aware of a Datasource rather than a connection pool.

Sesame in a Clustered environment

4 06 2006

Sesame has one major advantage in a clustered environment, it stores its content in a database. Im not saying this is good thing, but it just makes it easier to deploy in a clusterd environment where the only thing that is shared is the database. It should be relatively easy to make it work OOTB with Sakai… however, it looks like the default implementation of the Sesame RDBMS Sail driver (this is the RDF Repository abstraction layer) like to get a jdbc url, user name and password. This would be Ok, except that Sakai likes use a Data source.

The solution appears to be to extend various classes within the Sesame core rdbms implementation so that whenever a connection is required it comes from the Sakai data source rather than some separately managed JDBC pool.

Its not clear at the moment is Sesame is scalable enough to handle the potential number of triples that Sakai will generate. The tests of the Lucene part of the search engine were indexing about 5GB of data representing about 100,000 documents. Performance was perfectly acceptable. If the same document set was to put into a triple store, we will see at least 2M triples, and thats before we start to add in any work site ontology beyond the standard Sakai ontology.

If we get to this size of RDF store, we should also consider using Kowari but with an entirely native index format we might have to employ similar techniques to those used in the Lucene clustered search to make it work. Alternatively we could look at a dedicated RDF server… although I suspect that this would be too much deployment effort for most users.

Wiki Sub-Sites Groups and Sections

4 06 2006

In general the Wiki tool was well received, and the presentations done by Harriet, Andrew and Frances Tracy invoked thought. It was especially good to see faculty members relaying real teaching and research experience of Sakai in use.

However, we still have lots questions about how sections/groups are going to relate to Wiki sub-sites. The mapping idea, where a wiki sub-site maps to a section/realm group appears to make sense, but the UI for controlling the permissions and the way in which roles might be added to realms just isn’t clear enough yet. I hope to see this in in 2.2, but it might slip.