Wrong Content Model

17 10 2010

I knew there was something wrong. You know that gut feeling you have about something, when it just doesn’t feel right, but you cant explain coherently enough to persuade others so eventually self doubt creeps in, and you go along with the crowd. We have a mantra, borrowed phrase really, borrowed from JCR. “Content is everything”. Its possible it was borrowed without knowing what it really meant. One of those emperors new clothes things, this time the emperor really was wearing cloths, and they were so fantastic that perhaps we thought they would fit our build, so we borrowed them, not quite realising what that meant.

One of the founding principals of Sakai 3 is that content should be shared everywhere. That expands to being re-used, repurposed, reowned everywhere. To achieve that, there are two solutions give content ownership and allow users who own the content to organise and manage that content how they feel fit, including adding their own structure to the content if they feel thats what the want to do. Alternatively, put all the content into a big pool and let users find it, tag it, apply ontologies to it and crucially express access rights at the individual item level. Neither approach is right, neither is wrong. The former has compromises when ownership is changed the latter has compromises when each uses needs to manage large volumes of content individually. I am not going to say which is better, been burnt too many times trying to apply engineering logic to user experience design decisions, but one thing I do know is that the underlying implementation for one approach is very different from the underlying implementation for the other. Getting them the wrong way round is likely to lead to problems.

So the UX design process decided that the big pool approach was right. Often quoted in discussions was doing things like Google Docs. Handing round pointers to documents, identified by a key at the end of each URL opaque in meaning to the end user, but immensely scalable.  Expressing access rights as “share this item with my friends”, “make this public”, or “I’m happy for anyone who has this URL to edit this”. In that there was no expression of where the document lived, no organisation of the document into containers and certainly no management of access on containers, although it interesting to see that the limitation of managing large volumes of documents in a flat list has lead Google Docs to introduce folders where like minded documents get shared with collaborators by virtue of their location. The content model is scalable on two levels. On a technical level, its easy to generate billions of non conflicting keys per second and its easy to shard and replicate the content associated with those keys on a global scale migrating information to where its needed avoiding the finite limitations of the speed of light and routers. On a human level the machine generated key and per user hierarchy eliminates all conflicts. Google would be bust if it offered help desk resolution of conflicting document URLs where 2 users demanded to share the same web address for their document. By allocating gmail username on a first come first serve basis, Google managed to get tacit acceptance of a given naming scheme, without helpdesk load. How do you persuade Google to give you the userid of another user, just because you believe you have a right to it, and they got there first? Dream on ?

We chose a content technology because we wanted something that worked really well and covered the generalised use case of Content Management. The content system we have is hierarchical and maps URLs directly to content paths, exploiting hierarchal structure to make it easier to manage large volumes of data. It hits the sweet spot of Enterprise Content Management, but it it right for us ? In Sakai 3, UX design has chosen a pooled content model for all the same use cases as covered by Google Docs, but we are building it on a content system that requires agreement of URL space, agreement of locations within the content system and critically uses that hierarchy to drive efficiencies on many levels. Hierarchy is fundamental to the Java Content Repository specification, fundamental to Jackrabbits implementation and to some extents a natural part of the http specification. Attempting to layer a pooled content system over a fundamentally hierarchical storage system is probably a recipe for disaster on two levels. Technically it can be done, we have done it, but as my gut tells me we are beginning to find out, it wasn’t a good idea. All those efficiencies that were core to the content repository model are gone, exposing some potentially quite expensive consequences. At a human level we have side stepped the arguments over who owns what url in a collaborative environment by obfuscating the URL, but in the process we have snatched back from the user the ability to organise their content by hierarchy. The help desk operations that support Sakai 3 wont go bust processing URL conflict resolution since users dont get URLs to be conflicted over.

What should we do? I think we should admit that the models are separate and not try and abuse one user experience with the wrong supporting implementation or conversely force an implementation to do what it was not designed for. We have a crossover and we have to choose. For want of better words, we have to chose Content Management User Experience supported by content management storage living true to everything is content embodied by David’s model, or Pooled User Experience supported by UUID based object storage intended for massive scale. Mixing the two is not an option.

Hindsight is a wonderful thing to learn from; mistakes made ? Yes, we have mixed up a choice on a technical level with political desires forgetting that in a design led development process, technical decisions must be made purely to service the design. If the political desires were important, they should have adjusted the design process from the start. I live and learn.




6 responses

17 10 2010

Wading into an area that’s not my strength, so apologies if this is noise.

In terms of the basic first principles, if you’re designing a system with an exclusively pooled content approach, aren’t you effectively saying that content has no (public) context, aside from the implicit one: the base URI (which in this case would be the particular Sakai instance)?

So I guess that leads me to a question, and a comment.

The question might be a dumb one, but what happens if you want to share content across instances? E.g. what happens if you no longer assume a single pool? It’s an abstract question, but is driven by some concrete issues, like what happens if I want to create a course that’s shared in some form across instructors at different institutions (I’ve seen some really interesting examples of this recently)?

The comment is this: I guess I tend to think of the hierarchical URI, and by extension content model, as particularly suited to published content: me/courses/101/2010/fall/syllabus (the version of my 101 syllabus specific to a particular course instance), doej/blog/some-post (a user blog, which floats independent of any particular course), etc. E.g. it represents content that is deliberately placed in some context for some audience.

So I guess while I recognize there’s a lot of tricky technical and UI issues here, I have a hard time believing that simply adopting a pooled content approach actually solves them without introducing others.

In any case, would be great to see the trade-offs and ultimate decision articulated in the design docs and such.

18 10 2010

Yes, in the pooled scenario there is no context for each item, other than the base URI, and this that is a pointer to the pool itself the context is so wide it is effectively unusable, and that pool context is bound to the instance which is normally the institution. Sharing across institutions, if the sharing is not completely public creates a need for distributed identity (Shibboleth, OpenID, etc). Although the pool in itself has no context, the server design, yet to be adopted by the UI is built on allowing users to build their own ontology of tags (the design doc) and then annotate content within the pool, exposing the ontologies as URL access points with narrower and wider terms building the paths, which allows reuse of the content in multiple contexts (very simular to what you were discussing in concrete issues). That generates the opportunity to apply access controls at levels within the ontologies, but generates a massive headache where the pooled content model is implemented on an existing hierarchical store as ultimately all the ACLs also need to be applied at the pooled item level. Naturally we could just re-write the entire access control system, but at that point I start to think that the model is wrong, and certainly most hierarchical systems that asked to store billions of items in a single container can’t cope. Having said that, even global pools are hierarchical deep under the covers, in some index or other, but just not exposed or imposed on the higher level content systems.

19 10 2010

Is distributed identity support (OpenID, etc.) on the roadmap, or at least relatively easy to add at some point?

19 10 2010

OpenID is there as a result of Sling. Shib and others is relatively easy to add as a separate bundle or using a mod_auth_* in Apache HttpD.

17 01 2011
Vidar S. Ramdal

Did you ever consider writing a custom Jackrabbit AccessManager?

I feel that the ACL concept is not always the right thing, and in fact often makes sharing scenarios as you describe unnecessary complicated.
I found this speak interesting in that respect: http://vimeo.com/2723800

17 01 2011

The AccessManager we have is based on a customised version of the DefaultAccessManager in Jackrabbit. It keeps the ACL concept and rep:policy but makes some extensions. We need ACLs and where we do the DefaultAccessManager conceptually fits our needs, the problem is 2 fold. As you mentioned, describing sharing in terms of ACLs overcomplicates and if you have a lots of content with a lots of sharing the you have to dedicate more and more memory to the DefaultAccessManager to avoid it reloading ACLs. This is made even worse when the number of things that you want to share with, or the number of expected child items within the content system at any point is greater than 1000 items, then you are forced to create intermediate paths. The simple 1:1 mapping between conceptual content object breaks, and every update impacts several intermediate nodes creating more strain on the shared item caches, which see much more traffic. The sharing model that our UX designers are pursuing is much closer to a cloud of content than a hierarchical tree of content, which doesn’t appear to map well.

%d bloggers like this: