Tagged Search

10 11 2006

New feature for search, Tagged words of the search terms in a search results.

1. Take all the term vectors from Documents in a search result. 2. Merge the term vectors. 3. Take the top 100 4. Sort alphabetically. 5. CSS format the word sizes using a normalized frequency 6. Output the results.

The only problem that I can see at the moment, is that the Stemming algorithm (look and looking are equivalent terms) used means that terms can look a little odd. There probably isnt a way to un-stem. 😦

Jackrabbit Cluster

10 11 2006

Jackrabbit has created a clustered version… or at least there is a version in trunk that will be part of core that clusters. Jackrabbit can be made to use a shared DB quite easily, but if you do that and have 2 or more nodes accessing, the local cache maintains a stale copy of the data. The cache is what makes Jackrabbit fast. You could throw away the cache or make it write through expire, but that would kill much of the performance.

So the clustered Jackrabbit uses a file based Journal to record the changes, nodes reading that file can then invalidate their local cache based on the shared file. This has two drawbacks, firstly its slow as polling has to be used, and secondly its not scalable as only one can write at a time.

I dont know if its possible, but I would hope we could replace this filesystem journal with ActiveMQ and a JMS topic. All JCR nodes could subscribe to the Topic and invalidate the local cache based on JMS messages.

Clustered MySQL

10 11 2006

If you buy fast disks, put them in a cluster and put a DB ontop of them, then you might hope the clustering mechanism might not kill all that performance you were after. With 300MB/s disks and 1000BaseT interconnect DRBD generates a 3x slow down with mysql on anything that needs disk block writes. It would probably be Ok with slow disks, but when you find that order by generates disk writes, and most of Sakai selects have order by, then your onto a non starter. So we have abandoned DRBD for our HA cluster.

We did look at a SAN based DB, but decided that it didn’t satisfy some of the other requirements, including a replication slave to take hot backups from and a real redundant spare.

However, I did get MySQL replication with failover and failback working in a cluster. Failover is fast enough (

MySQL Cluster

3 11 2006

We have been having fun with a mysql cluster,

go the Linux-HA bit up without to many real problems and the Apache failover works fine. But clustering MySQL is not that easy. Master/Slave replication works and will failover, but because of the log offsets is quite hard work getting it to fail back correctly. It certainly cant do the merry ping pong that you can do with Apache.

We could use a SAN, but that would be defeatist. So DRBD, a network mirror. You might think it was too slow, but having Done some Bonnie++ tests on it it gives close to native speed for reads. And it mirrors to the pair quite well….. provided you have a 1000BaseT switch, I am getting sync rates on the network raid of about 80-90MB/s which is Ok. The nice thing about it is you can put all your software on the network raid, and so if you are not on the primary node, you wont be able to startup the software and damage the setup since standby nodes dont have their discs mounted……so this looks like it gives the failover benefit of a SAN with normal network hardware and no iSCSI.

When we did iSCSI tests a year ago, bonnie++ was extremely successful at destroying filesystems. It appears to be really good at exposing weaknesses.