Sesame RDBMS Implementation

5 06 2006

It looks like there are some interesting features in the Sesame default RDBMS implementation. Since it uses its own connection pooling, it tends to commit on close. If the standard connection pool that is used by default is replaced by a java.sql.Datasource, things like commit don’t happen when Sesame thinks they should have happened. The net result is a bunch of exceptions associated with lock timeouts, as one connection coming out of the data source block subsequent connection. The solution looks like its going to be to re-implement most of the RDBMS layer with one that is aware of a Datasource rather than a connection pool.

Sesame in a Clustered environment

4 06 2006

Sesame has one major advantage in a clustered environment, it stores its content in a database. Im not saying this is good thing, but it just makes it easier to deploy in a clusterd environment where the only thing that is shared is the database. It should be relatively easy to make it work OOTB with Sakai… however, it looks like the default implementation of the Sesame RDBMS Sail driver (this is the RDF Repository abstraction layer) like to get a jdbc url, user name and password. This would be Ok, except that Sakai likes use a Data source.

The solution appears to be to extend various classes within the Sesame core rdbms implementation so that whenever a connection is required it comes from the Sakai data source rather than some separately managed JDBC pool.

Its not clear at the moment is Sesame is scalable enough to handle the potential number of triples that Sakai will generate. The tests of the Lucene part of the search engine were indexing about 5GB of data representing about 100,000 documents. Performance was perfectly acceptable. If the same document set was to put into a triple store, we will see at least 2M triples, and thats before we start to add in any work site ontology beyond the standard Sakai ontology.

If we get to this size of RDF store, we should also consider using Kowari but with an entirely native index format we might have to employ similar techniques to those used in the Lucene clustered search to make it work. Alternatively we could look at a dedicated RDF server… although I suspect that this would be too much deployment effort for most users.

Wiki Sub-Sites Groups and Sections

4 06 2006

In general the Wiki tool was well received, and the presentations done by Harriet, Andrew and Frances Tracy invoked thought. It was especially good to see faculty members relaying real teaching and research experience of Sakai in use.

However, we still have lots questions about how sections/groups are going to relate to Wiki sub-sites. The mapping idea, where a wiki sub-site maps to a section/realm group appears to make sense, but the UI for controlling the permissions and the way in which roles might be added to realms just isn’t clear enough yet. I hope to see this in in 2.2, but it might slip.

Exploding Content Hosting Service

4 06 2006

We had some extremely productive conversations on the ContentHostingService towards the end of the Vancouver Conference. The basic idea of the ContentHostingPulugin was extended, and it looks like it might be worth attempting to restructure the ContentHostingService to separate the implementation of the default storage mechanism so that node properties are stored centrally, but all handling mechanisms become plug ins. This will me merging both ContentResource and ContentCollection into a single ContentEntity more fully so that the core of ContentHosting can treat them the same, regardless of where and how they are stored. This opens the potential to have tools inject ContentHostingHandlers into the ContentHostingService. ie Repositories, DSpace, IMS-CP etc etc etc, not going to be 2.2, but maybe 2.3. Will be working with Jim Eng on this, as its his code.

Semantic Search

4 06 2006

Currently Search performs its indexing on text streams. There is a significant amount of information that can be extracted from entities, beside the simple digest of content. This includes things like the entity reference, the URL, title, description etc. There is also other information. We could create multiple indexes for this in Lucene quite easily, but it would not necessarily provide the search structure that is required. A better approach is probably going to be to represent this in RDF. So Im going to try and enhance the EntityContentProcuder with an RDF stream and place a pluggable RDF triple store underneath the search engine to operate as a secondary stream. Its quite possible that this will solve some of the search clustering problems and will certainly address the results clustering that would begin to make search really cool.

Vancouver 2006

4 06 2006

Vancouver 2006 was a watershed conference for me. The Sakai Foundation, less than 6 months old, transitioned from a funded project into and Open source community. Chuck was appointed as CEO, thankfully, as selecting almost anyone else would have created a year or more of pain for the community. Problems and issues were openly exposed and discussed without blame of recrimination. The future feels positive.