Jackrabbit, Oak, Sling, maybe even OAE

30 08 2012

Back in January 2010 the Jackrabbit team starting asking its community what did it want to see in the next version of
Jackrabbit. There were some themes in the responses. High(er) levels of write concurrency, millions of child nodes and cloud scale clustering with NoSQL backends. I voted for all of these.

I was reminded of this activity by the announcement of a Oak Hackathon, or rather an Oakathon that is being organised this September at the .adaptTo conference in Berlin. This Oakathon seems to be intended to get users upto speed on using Oak, which means that it might be ready for users to take a look. So I did.  The code checks out, builds and passes all its integration tests. No surprises there from the Jackrabbit team.

I am not going to pretend I understand the storage model being used or how it addresses the requirements that came out of Jackrabbit, but the Persistence implementation looks like it could be adapted to a sharded schema over many DB instances or even ontop of Cassandra. The storage model looks something like a git tree. It seems to solve the many child nodes issue that Sparse Map Content solved for OAE in a slightly different, but more efficient way by using a DAG structure with pointers to a child tree rather than a parent pointer. I won’t be able to tell if the concurrency issues that caused me to have to squeeze Sparse Map Content into the Sling repository API layers, without some testing, but the approach looks like it might. Certainly the aims of the Road map cover the needs of OAE and go beyond the scale and concurrency required.

Best of all, it already has a Sling Repository implemented, so it should be relatively easy to spin up Sling on Oak and run all the tests that caused OAE to move from Sling on Jackrabbit to Sling on a hybrid of SMC and Jackrabbit.

Follow Up: ORM

13 06 2012

My last post was about the good the bad and the ugly of ORM. Having just suprised myself at the ease with which its was possible to build a reasonably complex query in Django ORM, I thought it would be worth sharing by way of an example. The query is a common problem. I want a list of users, and their profile information that are members of a group or subgroups. I have several tables, Profile, Principal, Group, GroupGroups, GroupMembers. The ORM expression is

q = Profile.objects.filter(

And the resulting SQL, which looks frightening, but appears to scale well on a populated DB is.

SELECT * FROM "user_profile" 
    INNER JOIN "auth_user" 
       ON ("user_profile"."user_id" = "auth_user"."id") 
    INNER JOIN "permission_principal" 
       ON ("auth_user"."id" = "permission_principal"."user_id") 
    INNER JOIN "user_groupmember" 
       ON ("permission_principal"."id" = "user_groupmember"."principal_id") 
    INNER JOIN "auth_group" 
       ON ("user_groupmember"."group_id" = "auth_group"."id") 
    LEFT OUTER JOIN "user_groupgroups" 
       ON ("auth_group"."id" = "user_groupgroups"."child_id") 
    LEFT OUTER JOIN "auth_group" T7 
ON ("user_groupgroups"."group_id" = T7."id") 
    WHERE ("auth_group"."name" = ? OR T7."name" = ? ) 
    ORDER BY "user_profile"."search" ASC LIMIT 25'

Job done, 1 query for the feed paged and sorted, 7ms on a populated DB.

Monitoring SparseMap

24 02 2012

Yesterday I started to add monitoring to sparsemap to show what impact client operations were having and to give users some feedback on why/if certain actions were slow. There is now a service StatsService, with an implementation StatsServiceImpl that collects counts and timings for storage layer calls and api calls. Storage layer calls are generally expensive. If SQL storage is used that means a network operation. If a column DB storage layer is used, it may mean a network operation, but its certainly more expensive than not. Storage calls also indicate a failure in the caching. The StatsServiceImpl can also be configured in OSGi to report the usage of individual Sparse Sessions when the session is logged out. Normally at the end of a request. So using long term application stats should tell a user of SparseMap why certain areas are going slow, what storage operations need attention and if using SQL give the DBA an instant indication of where the SQL needs to be tuned. The short term session stats should enable users of sparsemap to tune their usage to avoid thousands of storage layer calls and eliminate unnecessary API calls.  Currently output is to the log file. Soon there will be some JMX beans and maybe some form of streamed output of stats. The log output is as follows.

app Key Slow Queries, T, N, Q 
app Slow Queries, 55, 1, cn:index_update:update cn_css_w set created = ? where rid = ?
app Slow Queries, 121, 1, ac:select:select b from ac_css_b where rid = ?
app Slow Queries, 272, 3, cn:select:select b from cn_css_b where rid = ?
app Slow Queries, 212, 3, cn:update:update cn_css_b set b = ? where rid = ?
app, Key Counters, ncalls, login, loginFailed, logout, slowOp, op, api
app, Counters, 2000, 2000, 0, 2000, 0, 4005, 0
app Key Storage, T, N, OP
app Storage, 2529, 2057, cn:select
app Storage, 1241, 1023, cn:index_update
app Storage, 888, 1004, cn:update
app Storage, 11, 18, cn:insert
app Storage, 21, 32, ac:select
app Storage, 10, 6, au:select
app Storage, 4224, 4001, n/a:validate
app Key Api Call, T, N, M
app Api Call, 11699, 999, org.sakaiproject.nakamura.lite.content.ContentManagerImpl.update
app Api Call, 217, 995, org.sakaiproject.nakamura.lite.accesscontrol.AuthenticatorImpl.signContentToken
app Api Call, 11, 1, org.sakaiproject.nakamura.lite.content.ContentManagerImpl.getInputStream
app Api Call, 691, 6061, org.sakaiproject.nakamura.lite.accesscontrol.AccessControlManagerImpl.signContentToken
app Api Call, 122, 2, org.sakaiproject.nakamura.lite.content.ContentManagerImpl.writeBody
app Api Call, 1250, 5059, org.sakaiproject.nakamura.lite.content.ContentManagerImpl.get
Output is in CSV form in the logs so the logs can be processed later the column headings have the following meaning.

T total time in ms
N number of times
M method
OP operation
Q query.

OP  is parsed as columnFamily:operation so cn:select means a select on the Content columnFamily.

For the DBA the Slow Queries are the ones to watch.

In this instance, at some point during the lifetime of this JVM

select b from ac_css_b where rid = ?

took 121 ms to run but it happend only once. (The laptop hung while performing a Time Machine backup)

This feature will be in version 1.6