Triple Stores, Marsupials and Communities

24 10 2006

What happens when an a corporate tries to buy and exploit open source without the community? The likely hood is that it dies, Open source is the community. When Northrop Gumman bought Tucana it acquired the copyright to parts of the Kowari code base which was available under an MPL license. It then sent letters to some of the developers preventing them from releasing version 1.1 claiming it would damage their business. Presumably to avoid damages the key developers left, and the Community entered a log jam. The kowari-dev list makes interesting reading, especially the lack of messages since May 06. The last one being from from a lawyer. Unfortunately I suspect that the kowari-general list would have made interesting reading…. but thats disappeared with all the old messages.

So what did the developers do ?

The forked the code from before the disputed period, and re-did all the development since then in a clean-room. The also changed license to a OSL-v3 which is sticky to prevent code becoming closed source.

The community looks active, willing to help and friendly. When I asked could Sakai use Mulgara with the current Sakai license and still allow commercial affiliates to have closed source components, the answer was “thats what we intended, and let us know if you want help”

The full response is

and the positive nature of the community is expressed in

Kowari and Mulgara are both Oz Marsupials, Open source is the community and commercial is what pays the bills…. all is right in the world.

Hosting Darwin

20 10 2006

A few months ago, Caret offered hosting space to Darwin online a Cambridge University Project. We didnt write any of the code, that work was done by Antranig Basman who works at Caret part time.

This Wednesday we found out that the BBC was interested in interviewing the project lead, which they did. On Thursday morning the site went live. The BBC broadcast items on Breakfast Radio and TV news and a number other channels and within 4 hours we had seen 2M hits on the site. At then end of the day I think we saw 4M hits and today, Friday, I think there have been another 4M hits. This probably is that big a site, but we were expecting about 500,000 hits in the first day, so there was quite a bit of on the fly load testing as it ramped up. Credits must go to Daniel Parry and Sultan Kus for tuning the deployment on the fly and monitoring the infrastructure and for Antranig, for developing something that didn’t crash in the first few hours.

What we learn’t….. 80% of the visit started from people typing in the URL, not from search engines or referrers, 10% were from slashdot…. traditional media channels are alive and well, at least in the Uk.

We should have load tested a bit harder and deployed on hardware matched to the load…. we should have predicted the load better as well.

Its much easier to load an application up with a few 100,000 users all wanting to get to the information, than it is to use a load tester…. everyone a few 100,000 willing users, don’t they.

ActiveMQ / Kowari / Sakai Events

20 10 2006

For some time, I’ve been thinking about how the Sakai Events which can fill up a production database should be managed. Although of interest in the Sakai database they are not necessarily needed for the smooth running of a Sakai database, and when there are 10 – 20M events present, the event service slows down a little on insert.

So, I’ve been playing with ActiveMQ and JMS. at I’ve put together a prototype JMS adapter that channels Sakai Events into a local JMS broker, which then, via a bridge propagates the event to a hub Broker. This hub broker of which there would be one or more per Sakai cluster takes the feeds of messages and further distributes them.

One such listener on the Hub might be a JMS to RDF converter that will take the JMS serialization of the Sakai Event convert it into Triples and push it into a Kowari instance.

Another use of the hub could be AJAX based real time monitoring of the event feed for NOC type operations.

A side effect of this loose binding, is that JMS Consumers can exist a loose bound components, and so can use GPL jars without causing problems for the main Sakai code base, but then they cant be distributed as part of Sakai… so its all a bit moot.

Segment Merge

10 10 2006

The Segment merge algorithm in search is dumb and needs to be made better. It has a habit of not merging upto the full 2G segment size at the moment and needs to be made better. This has the advantage that we don’t ask for massive transfers, but it would be better to be able to ask for a target segment size and actually get it.

Structured Storage of Segments

10 10 2006

When the size of the index gets big there are some problem that I thought wouldn’t appear. A 500G index, of 1G segments is going to have at least 500 files in the local segments space and in the shared segments space, at this size I would hope on the file system.

500 files in a directory might be Ok in ext3 on local disk, but on AFS/NFS/SMB NAS file system its likely to cause problems.

The solution, hash the file system. Just like is done with your local firefox web cache and all sorts of other systems. In the latest version of search from trunk there is a first level hash that will limit the number of files in the base directory to 100. This can be turned on on both the shared segment store and local segment store, but it must be the same on all nodes. When the search service starts up it automatically reconfigures the store to match the segmentation scheme that is being used.

Search Hardware Requirements

10 10 2006

The hardware requirements of search are somewhat undefined…. why? Because we are dealing with a variety of document types with all sorts of content. A 10M PDF might contain only 10K of indexable content, and a 100K email message might contain 99K of indexable content, this makes it difficult to come up with anything precise about the size of the index.

I have recently moved to not storing the digested content in the index to reduce the size of the index. Since the index records are now just offsets into a terms vector the compression is far greater than before.

Sakai search, being based on Lucene is quite similar to Nutch which informed parts of its indexing operation, so perhaps some of the metrics from the Nutch community are valid. They indicate that 100M documents will require 1TB of index space. A single node with 1 search query per second can handle 20M documents, and at 20 search queries per node, the node can handle 2M documents with 4G of ram. They must be thinking of 64 bit architectures. I suspect that this level of performance is an order of magnitude greater than required by Sakai, but if we look at the space requirements we might get 10G per 1M documents. Obviously the mix of documents is different from a typical search engine load. In version upto now, we haven’t reached that target, on an earlier version with uncompressed full content in the index we were only getting 50% compression on the original size…. the later trunk code should improve this as I think is has already done for Cape Town.