Jackrabbit PM in columns

14 05 2010

A while back I asked about Cassandra as a Persistence Manger for Jackrabbit 2. The problem that exists with any Column DB and to some extends Cassandra (although it has a concept of Quorum) is that the persisted value is eventually consistent over the whole cluster. Jackrabbit PM’s, at least in JR2 and earlier need a level of ACID in what they do. The PM acts as a central Transaction Monitor streaming updates to storage and maintaining an in memory state. If you look a JR running against an RDBMS this is obvious. The ratio of select to modify is completely reversed with 90% update and 10% read inspite of the JR application experiencing 90% read and 10% update. This presents no problem on a single node, provided the PM give some guarantees to ACID like behaviour. The problem comes when JR is run in a cluster. When the state of items is changed on one node, that change must be known on all nodes, so that their internal caches can be adjusted. Thats what the ClusterNode implementation in JR2 does. In a cluster it goes further. The sequence which the changes are applied must be consistent, so that a add, remove, add happens in that order, on all nodes. Finally, and crucially for a persistence that is eventually consistent, all nodes must be able to see  the committed version of an item when they receive the event concerning the item.

Thats why I asked about Cassandra. If you took the existing RDBMS ClusterNode implementation, which uses a DB table to ensure the sequence of events is the same on all nodes, then events would get replayed correctly, but when they accessed their local connection to the column DB, its almost certain that their local state would not be sufficiently consistent to provide the correct version of the data. I have ignored the slight problem that the journal managed by a central DB is still a central point of failure, a synchronization point for the entire cluster (with a reasonable amount of latency), and botteneck for throughput.

So I did a bit of playing, its not complete, but I think its possible to have a ClusterNode implementation tightly coupled with a PM, where the ClusterNode uses JGroups to manage the journal sequence, with one elected master emitting sequence numbers and all slaves listening in waiting to become the next master should the current one fail. The approach looks like it will happily produce sequence at several orders of magnitude greater than the central DB approach. The second part is to use the modified ClusterNode to communicate a version of the Item in its event. Should the receiving node not get that version of the item, it can come back to the originating node and ask it for a copy, avoiding the problems of eventual consistency and the penalty of a fully quorate column db. Now, I havent tested any of this other than the cluster node impl, so I dont know if trafic back to the consistent node for the correct version of the item will dominate or not, but it all looks feasable.

Having said all that, the discussions on JR3 which looks like making it into trunk any moment now may make all of this obsolete, as there was talk of using a append only node hierarchy store, simular to that used by Git, CouchDB and others. This might still need to overcome the consistency issues when distributed, but it would be core rather than bolt on. Need to go an look ….

Advertisements

Actions

Information

4 responses

15 05 2010
Vincent Olivier

So how do we help on JR3. I’m currently posting on my blog the various projects were I would like such a thing. Here is the first one, yet maybe, the least interesting in the short term (other details coming soon about the other ones): http://up4.com/g

I just registered on the mailing list…

15 05 2010
Ian

Probably the best way is to checkout the Jackrabbit Trunk with the JR3 parts in and kick the tyres. The dev@jackrabbit.a.o although a working list, is open to analysis and discussion based on evidence. Search for messages with jr3 in the subject to get the feel for it. BTW, I havent had a chance to see where the code is going yet.

15 05 2010
Vincent Olivier

Hi Ian,

The reason I’m saying that we should try JR3 right away, is that, even if I didn’t look at ClusterNode, I think all the clustering work should be delegated to the persistence engine. Is that why you say it should be a “ClusterNode implementation tightly coupled with a PM”? In the sense that it would be a very thin ClusterNode impl? Wouldn’t it be better if, to JR, the storage was just one big service with one entry point that took care of everything? And then JR would just be logic in the web layer on top of the storage layer, that could also be clustered, for auth and session stuff, I guess? Is that where JR3 is going?

Or maybe I’m wrong and we should just go for JR2 right away, ripping off the logic that is not tolerant of an eventually-consistent approach in the CN & PM?

15 05 2010
Vincent Olivier

And also, you should integrate a comment system like Disqus on your blog, so I can easily keep track of this conversation! 😀




%d bloggers like this: