Sakai CLE ElasticSearch

11 10 2012

A long time ago, I wrote a search module for Sakai 2 as CLE was known then. It attempted to make every node in a CLE instance share the load of indexing and searching and make the search aspect of a CLE cluster scale elastically. To some extents it worked, but it had problems. The indexing queue was persisted in a DB table and it was based on a old version of Lucene that didn’t have anything as useful as commits. Consequently it could get its segments into a bit of mess at times. The world has moved on in the 5 years since I wrote that code, and two viable alternatives for supporting Search in Sakai CLE have emerged. Apache Solr and Elastic Search. Both can be run as remote servers or embedded. Both are solid reliable releases. It could be argued that Solr has more support for sophisticated index schema, and it’s probably true that Elastic Search is easier to deploy for elastic scaling and real time indexing as that’s its default behaviour.

For those wanting to try Sakai CLE with Apache Solr as the search server then look no further than the work that Adam Marshall has been doing at Oxford University. That allows you to spin up a Solr instance and connect your Sakai CLE instances to it. You will have to do some reasonably sophisticated master slave configuration to make it resilient to failures and don’t expect the indexing operations to be real-time. There are plenty of references to the work required to do that in this blog, and arguments why I currently prefer ElasticSearch over Solr.

Deployment and reliability

ElasticSearch comes out the box being real-time, elastic and cloud aware, with built-in AWS EC2 knowledge as well as rack awareness. Its been built to shard, partition and replicate indexes out of the box. The ElasticSearch client as I am finding out is simple to embed into most environments including OSGi and when embedded makes each app server node a part of elastic search cluster. Best of all, for the nervous by nature, is the resilience that comes from spinning up more than 3 instances in the same cluster. In fact, I have been finding it hard to damage elastic search indexes in tests. It’s perfectly possible to do all of this with Solr, but the deployer has to work a little harder adding some custom components to support a writeahead log and a Zookeeper instance to manage the cloud.

Metadata Indexing

Probably the best part of ElasticSearch is the client which is a fully multithreaded client following the same pattern Communicating Sequential Processes first described by Tony Hoare and one of the motivators for the Go language. This allows a client for submit suitably light weight indexing requests to the ElasticSearch cluster via an embedded client without needing to think about managing a queue or the latency of indexing. This nice little feature turns the 1000 lines of code I had to write for Sakai CLE  and OAE search into about 20. Initial tests show that indexing can be done within the request loop and because of the true real-time nature ElasticSearch with its write ahead log, results are available about 50ms after the transaction commits. To maintain that latency, I only index metadata via this route. Document indexing takes a different route.

Document Indexing

I found with the original Sakai 2 search and subsequent Solr based indexing of documents in Sakai OAE that indexing bodies was expensive. In some instances tokenizing office documents could place extreme strain on a JVM heap. For that reason when I did the indexing service in the Django version of OAE I did two things. I offloaded the document body indexing operations to separate processors driven by a queue of events, following the CSP pattern mentioned above, and I made the content store single instance. Where users collaborate, they often upload the same document. With a single instance content store, only a single instance of a document is stored and hence, tokenizing and information extraction is only performed once. This greatly reduces the cost of indexing. The store isn’t collision perfect but by performing a hash on the document body as its saved its possible to eliminate most if not all collisions. Certainly SHA1(ing) enough of the body eliminates all collisions.

So the document indexing processes use the index to locate documents that need to be indexed and then use the single instance content store to eliminate duplicate tokenizing. Using this approach in the Sparse Content Map content system which is already single instance has a dramatic impact on IO. Sakai CLE Content Hosting Service is not single instance at present but could be adjusted to be so once hashes are known. It would be nice to fix that aspect of CHS at some point.

Current state

I am still working on this code, and this post is part notes, part notification should I get distracted. My testbed is the Sparse Content Map content system only because it builds in 20s, starts in 5, has full integration test coverage and compliant webdav support thanks to Milton. There is currently nothing in the code base that prevents it using Spring or a Webapp container as opposed to OSGi, and the coupling is loose being event driven. The best part is the result should scale as far as ES can scale which is probably a lot larger than any CLE instance in production.

Fibonacci ring for Cassandra

10 10 2012
King Protea (Protea cynaroides)

King Protea (Protea cynaroides) (Photo credit: Wikipedia)

No this isn’t a greek tragedy or some software that I have written, but a thought about the way in which Apache Cassandra an other distributed systems perform problem space decomposition. Cassandra is a good example of a distributed system with problem space decomposition. Its problem space is keys. To be efficient it needs to distribute those keys evenly around its cluster. The key partitioning algorithm normally uses something that generates a flat even distribution. A Linear Congruential Generator  could be used if you are prepared to live with some banding in the problem space. If not and you are prepared to live with a bit more computational expense one of the hash functions like MD5 or SHAx. In fact the standard key distribution functions in Cassandra use something based on MD5, which to my naive mind must have some collisions.

In reading the Cassandra documentation and using it some years back I became concerned about how elastic Cassandra is. The decomposition of Cassandra’s key domain is often represented as a ring. That ring is constructed when the cluster is creates and elements are allocated via the key-> ring function, I think they are called partitioners. From reading the documentation, partitioning of this space if fixed and static. If more nodes need to be added to a Cassandra cluster then the partitioning scheme must be updated and data must be migrated from existing nodes in the cluster to their new home before the cluster can become full active again. I think I got that right. That means, although you can replace nodes, you can’t elastically scale without partitioning work. I am not absolutely clear if that means the re-partitioning work can be done on a live system, or not. I would hope it can.

That got me thinking. There are other systems that repartition effectively during operation. Algebraic Multigrids used to solve high Reynolds number Eulerian grids repartition to accelerate the solution phase. I wrote a parallel AMG solver to run on Cray T3Ds in 1995. It was fast, efficient with good conversion rates  but struggled to beat the Cray vectorised versions of the code base on reasonable sized clusters. There is another. A plant. A plant doesn’t shutdown when it adds petals to its flower or leaves to its stem it keeps running (so to speak, I havent seen a running flower since University). The plants domain space that its partitioning is sunlight. As it adds leaves doesn’t add leaves as a whole ring, but it adds them one by one to make the most use of the available sunlight without shading other spaces. It doesn’t require that the cells from one leaf or petal migrate to the new leaf. In essence a plant has achieved the trick of scaling elastically.

How does it do this ?

There is a biological explanation associated to levels of hormones in the stem which are triggered by light levels which could be considered to be as adaptive as the AMG solver is, driven by its solution. Stepping back a bit there is an observation often used in math classes. The number of spirals in many plants is observed to be adjacent numbers in the Fibonacci sequence, often 8, 13 and 21 but sometimes as high as 144 spirals. There is a delightful explanation of Pinecones, Pineapples, Protea and the Fibonacci sequence by Vi Hart, even if you think you have learnt everything, its fun to watch.

How is this relevant ?

I wonder if a Cassandra ring seeded with an initial space that allowed say 5 partitions, but as those partitions passed a threshold of say 30% (with an even distribution) another partition was added. That new partition would attract new keys without requiring migration of the existing keys ensuring that the original partitions never filled. If successful as new nodes were added in the same way as segments are added to a pineapple the Cassandra cluster could scale elastically, or more elastically than it appears to do currently. That really is just a thought, and I havent written a partitioner yet to see if it would work. I think the partitioner would be based on the the ratio of adjacent numbers in the Fibonacci sequence. ie, the Golden Angle

Node.js vs SilkJS

28 09 2012

synchronous ducks

Node.js, everyone on the planet has heard about. Every developer at least. SilkJS is relatively new and creates an interesting server to compare Node.js against because it shares so much of the same code base. Both are based on the Google V8 Javascript engine that convert JS into compiled code before executing. Node.js as we all know uses a single thread that uses a OS level event queue to process events. What is often overlooked is that Node.js uses a single thread, and therefore a single core of the host machine. SilkJS is a threaded server using pthreads where each thread processes the request leaving it upto the OS to manage interleaving between threads while waiting for IO to complete. Node.js is often refereed to as Async and SilkJS is Sync. The advantages to both approaches that are the source of many flame wars. There is a good summary of the differences and reasons for each approach on the SilkJS website. In essence SilkJS claims to have a less complex programming model that does not require the developer to constantly think of everything in terms of events and callbacks in order to coerce a single thread into doing useful work whilst IO is happening. Although this approach hands the interleaving of IO over to the OS letting it decide when each pthread should be run. OS developers will argue that thats what an OS should be doing and certainly to get the most out of modern multicore hardware there is almost no way of getting away from the need to run multiple processes or threads to use all cores. There is some evidence in the benchmarks (horror, benchmarks, that’s a red rag to a bull!) from Node.js, SilkJS, Tomcat7, Jetty8, Tornado etc that using multiple threads or processes is a requirement for making use of all cores. So what is that evidence ?

Well, first read why not to trust benchmarks once you’ve read that lets assume that everyone creating a benchmark is trying to show their software off best.

The Node.js 0.8.0 gives a request/second benchmark for a 1K response at 3585.62 request/second.

Over at Vert.x there was an of Vert.x and Node.js showing Vert.x running at 300,00 requests/s. You do have to take it with a pinch of salt after you have read another post with some detailed analysis that points out testing performance on the same box with no network and no latency is theoretically interesting, but probably not informative for the real world. What is more important is can the server stand up reliably forever with no downtime and perform normal server side processing.

So the SilkJS benchmarks in one of its more reasonable benchmarks claim it runs at around 22,000 request per second delivering 13K of file from disk with a very high levels of concurrency 20000. Again its hard to tell how true the benchmark is since many of those requests are pipelined (no socket open overhead), but one thing is clear. With a server capable of handling that level of concurrency some of the passionate arguments supporting async servers running one thread per core are lost. Either way works.

There is a second side to the SilkJS claims that bears some weight. With 200 server threads, what happens when one dies or needs to do something that is not IO bound? Something mildly non trivial that might use a tiny bit of CPU. With 1 server thread we know what happens, the server queues everything up while the on server thread does that computation. With 200, the OS manages the time spent working on the 1 thread. There is a simple answer, offload anything that does and processing to a threaded environment, but then you might as well use an async proxy front end to achieve the same.

There is a second part of the SilkJS argument that holds some weight. What happens when 1 of the SilkJS workers dies? Errors that kill processes happen for all sorts of reasons, some of them nothing to do with the code in the thread. With 199 threads the server continues to respond, with 0 it does not. At this point everyone who is enjoying the single-threaded simplicity of an async server will, I am sure, be telling me their process is so robust it will never die. That may well be true, but process sometimes dont always die, sometimes they get killed. The counter argument is, what happens when all 199 threads are busy running something. The threaded server dies.

To be balanced, life in an async server can be wonderfully simple. There is absolutely no risk of thread contention since there is only ever one thread, and it doesn’t matter how long a request might be pending for IO for as all IO is theoretically non blocking. It doesn’t mater how many requests there are provided there is enough memory to represent the queue. Synchronous servers can’t do long requests required by WebSockets and CometD. Well they can, but the thread pool soon gets exhausted. The ugly truth is that async servers also have something that gets exhausted  Memory. Every operation in the event queue consumes valuable memory, and with many garbage collected system, garbage collection is significant. Although it may not be apparent at light loads, at heavy loads even if CPU and IO are not saturated, async servers suffer from memory exhaustion and or garbage collection trying to avoid memory exhaustion, which, may appear as CPU exhaustion. So life is not so simple, thread contention is replaced by memory contention which is arguably harder to address.

So what is the best server architecture for modern web application?

An architecture that uses threads for requests that can be processed and delivered in ms, consuming no memory and delegating responsibility for interleaving IO to the OS, the resident expert at that task. Coupled with an architecture that recognises long IO intensive requests as such and delegates them to async part of the server, and above all, an architecture on which a simple and straightforward framework can be built to allow developers to get on with the task of delivering applications at webscale, rather than wondering how to achieve webscale with high load reliability. I don’t have an answer, other than it could be built with Jetty, but I know one thing, the golden bullets on each side of this particular flame war are only part of the solution.

Google CourseBuilder, a scalable course delivery platform ?

15 09 2012

This week I discovered Google CourseBuilder, the latest entry into the MOOC arena. It’s a Google App Engine application that Google Research used to host a MOOC to 155K students a few months ago. It follows a simular pedagogy to that used by other MOOC providers with high quality video lessons, that give the student the feeling they are working one on one with the lecturer. Google have open sourced the code under and Apache 2 license which gives us all an insight into the economies of scale that a MOOC represents. Unlike the traditional Virtual Learning Environment where the needs of staff are catered for in the user interface, Google CourseBuilder currently delegates all the functionality to spreadsheets, editing snippets of javascript and html. There is no reason why it could not be given an user interface, but when you consider what its is trying to do you realise that staff user interfaces for course creation are less important than the delivery of the course at scale. Consequently the application itself is tightly focused on delivering the course as quickly and as simply as possible to as many users as possible. Google App Engine makes this easy, even for meer mortals. Once you have accepted that nothing is really for free, and you do have to pay for bandwidth used and energy in at some point scaling this application upto 100K or even 1M users requires little or no effort on  your part. You also, at the moment, have to accept if you are going to reach that many students, you are going to have to ask for a little bit of help from someone to write some HTML, drive a spreadsheet and write a bit of Javascript as well as hit the “deploy” button on the App Engine SDK. I say, at the moment, because it isn’t going to be that hard to create an administrative UI, and thats what I have been doing for a few hours this week.

So the reality is, very few lecturers are going to create a course that will be delivered to 155K students, and if they succeed in going viral, the drop out rate is likely to be very high. The course Google ran issued 22K certificates, indicating a drop out rate of 85%. Its still an impressive number when many campuses are no where near that size however, most institutions would not survive with that level of drop out and all would be looking at ways of reducing it. Institutions invest more in their students and so need lower levels of drop out. As a result, their courses are smaller, they don’t have the economies of scale and can’t invest as much in the delivery of each individual course. All is not lost however, the opportunity that Googles CourseBuilder represents could be utilized if there was a small reduction in effort associated with course creation and course delivery.

The video attached to this blog post shows how that might be achieved. This is a modified version of Google CourseBuilder that allows a single Google App Engine to host more than one course. It could easily host a course catalogue from an small institution or medium size faculty. That course catalogue is uploaded via a spreadsheet. Individual courses containing units and lessons are also uploaded via seperate spreadsheets.

Students sign in using their Google ID, Google Apps for Education ID, or OpenID. They then register with the the courses they want to take. If you want to give it a try there is a App Engine Instance running at, bear in mind its a free instance so may become unavailable.

At the moment the administrative interface is very basic, but the intention is to build that up to allow courses to be created without needing to resort to technical resources. So far I have spent about 4h eliminating most of the code base editing and adding multi course capability. The code base is available as a fork of the Google CourseBuilder project and can be deployed by anyone with a Google ID. Since the original code was written in Python, using a modern variant of the GAE framework porting to Django would be trivial  with those who have concern about running on Google infrastructure. Obviously in doing so, you will have to work out how to do the scaling, see Instagram for pointers on that.

Jackrabbit, Oak, Sling, maybe even OAE

30 08 2012

Back in January 2010 the Jackrabbit team starting asking its community what did it want to see in the next version of
Jackrabbit. There were some themes in the responses. High(er) levels of write concurrency, millions of child nodes and cloud scale clustering with NoSQL backends. I voted for all of these.

I was reminded of this activity by the announcement of a Oak Hackathon, or rather an Oakathon that is being organised this September at the .adaptTo conference in Berlin. This Oakathon seems to be intended to get users upto speed on using Oak, which means that it might be ready for users to take a look. So I did.  The code checks out, builds and passes all its integration tests. No surprises there from the Jackrabbit team.

I am not going to pretend I understand the storage model being used or how it addresses the requirements that came out of Jackrabbit, but the Persistence implementation looks like it could be adapted to a sharded schema over many DB instances or even ontop of Cassandra. The storage model looks something like a git tree. It seems to solve the many child nodes issue that Sparse Map Content solved for OAE in a slightly different, but more efficient way by using a DAG structure with pointers to a child tree rather than a parent pointer. I won’t be able to tell if the concurrency issues that caused me to have to squeeze Sparse Map Content into the Sling repository API layers, without some testing, but the approach looks like it might. Certainly the aims of the Road map cover the needs of OAE and go beyond the scale and concurrency required.

Best of all, it already has a Sling Repository implemented, so it should be relatively easy to spin up Sling on Oak and run all the tests that caused OAE to move from Sling on Jackrabbit to Sling on a hybrid of SMC and Jackrabbit.

Massively Online

22 08 2012

Paris Metro logo Español: Logo del Metro de Pa...

In mid 2008 during the Sakai conference at Paris amongst the early summer european heat and the over friendly crowding of the Paris Metro I was involved in a small group of friends who had the idea that Sakai could recapture its innovative lead in Higher Education. There were representatives of commercial organisations at the conference looking nervously at our plans and taking potential customers to one side. Over the months that followed our plans grew, fueled by enquiries of what scale. The application we started building then was built so that every operation was by pointer, scaling in every operation such that the power of N was considerably less than 1. This was not done purely to satisfy the desire to serve the enquiries for collaboration communities of upto 12M users. It was done to enable one of the very early aims of the project.

There is a iTunesU lecture from Stanford that I remember watching on an iPod travelling through the suburbs of Chicago in a taxi on the way to the airport. It told the story of medical device company (a case study) where a new CEO wanted to drive the bottom line throughout the entire organisation. Monthly reports were created based on current months data, unheard of in the company, and where headings could not be filled the words “Insufficient Data” were placed. This so incensed the readers of these reports that soon every line was filled. Had blanks been left, perhaps human curiosity would not have taken over. Most of us dislike the unknown and attempt to fill it with information.

Earlier that year Google had announced OpenSocial. A standard intended to make it easier for developers to build applications that integrated with a social network, perhaps an answer to Facebook’s App environment. The underlying motivations for this initiative was not really to pander to developers but rather to increase the quality, volume and hence value of the data that Google was able to collect on its product, you and me, or rather what we find relevant.

In 2008, I wanted to do two things. Make Sakai easier for developers to develop for, and make it a suitable platform for running at massive scale, not for the sake of scale but for the sake of collecting data, big data, from which to make informed decisions about the actions we take in Higher Education. Those actions don’t inform or drive purchase decisions, they inform and drive whole lifetimes of achievement. Saving a drop out student, creating a cancer researcher changes lives.

That was 2008. In 2009 presented a keynote speech at North Sidney Institute for the AuSakai conference. Although the speech was poorly delivered, the message had developed having attended and listened to Hadoop/Pig/Mahout sessions at Apache conferences. Higher Ed needed to learn from the analytics and data analysis being employed in the wider web. Our data was far richer in metadata than a marketing web site and failing to put the infrastructure in place to collect that rich metadata without reduction would put Higher Ed at a disadvantage.

Fast forward to today. The organisations that approached me with requirements for scale have moved on. One group, MIT and Harvard has formed edX, delivering courses at massive scale, collecting data and using the results to mark, learn and improve. Other groups went internal with development and new comers like Coursera have exploded on to the scene delivering the scale and insight that Sakai 3 was being built to deliver.

With the benefit of hindsight I can see the differentiator. Sakai 3, now Sakai OAE tried to solve the use case of content authoring for course delivery to small groups teaching. Analytics becomes irrelevant in that environment, and the application is incapable of making the compromises necessary to be adapted for scale. Like pushing a bolder up a hill. Contrast that to the workflow of Corsera and edX. The bolder is already at the top of the hill. Massive scale makes the business model work and so content authoring and creation can take on a wholly different nature. The problems to be solved are that of huge numbers of users interacting with a small number of courses, not of thousands of users and thousands of courses all interacting with each other in random ways.

The future. I suspect the quality and precision of teaching, backed by analytics and evidence that world class materials can deliver through the edX and Coursera platforms will make those platforms the new Google of education. Just as no small online advertising platform can compete in generalist online ad placement against Google, the platforms that don’t have the ability to collect data and use it appropriately will become irrelevant. I can’t tell who will win this the race to grab this new land, sadly I think what I started in 2008 never got out the blocks in this race.

Evolution of networks

2 08 2012

In late 1993 I remember using my first web browser. I was on a IBM RS6000, called “elysium”. It had a 24bit 3D graphics accelerator, obviously vital for the blue on grey text that turned purple when you clicked it. I also needed it to look at the results of analysis runs. The stress waves flowing through the body shell of the next BWM 3 series, calculated by a parallel implementation of an iterative dynamic solver I was working on for MSC-Nastran. I was at the Parallel Application Center in Southampton, UK. I worked on European projects doing large scale parallelisations. Generally in engineering but always solving practical problems; Processing seismic survey data from the south atlantic in 6 days rather than 6 weeks on HP convex clusters. Monte Carlo simulations of how brain tissue would absorb radiation during radio therapy, avoiding subjecting the cancer patient to two hospital visits. It was interesting work, and in my nieve youth I felt at times I was doing good for humanity. Using a web browser was becoming part of my normal life. I used it to visit the few academic sites that contained real information, to access papers and research that previously we would have received in paper form from the British Library. This was the first network, exciting, raw, generating a shift in how I communicated and how effective I was.

I remember some time that year, hearing about how this largely academic network was growing. It wasn’t going to be dominated by .edu  and, but there was going to be many many more .com s. I was worried by this. Concerned that the influx would bury a new found source of information. I was being selfish. One of my friends had foresight. He registered at a time when all web addresses were no more than TLAs. Some years later he had to hand it over to American Airlines. My own web address came a few years latter at a time when registration was a paper process accompanied by proof of registration at Companies House. This first network evolved with an influx of people and organisations performing a land grab. Soon all the TLAs were gone. Soon the number of porn sites far outweighed those with research content, and my worst fears were not realised as the search engines also evolved. Thanks to those earlier pioneers I can still find what I need.

Fast forward 2003, the doc com boom came and went. Sites that pushed content grew and fell as did the strange concept of money from nothing. Some survived. A new sort of network started. A network of humans using a site called MySpace. Initially it was cool and useful. It grew at a crazy pace with little or no control, allowing anyone to do almost anything on a web page.  And they did. Pretty soon it became one of the most unpleasant places to be on the internet. Then Murdoch bought it. No one talks about it any more. The myspace sign, if there is one doesnt appear on TV adverts beside that of Twitter or Facebook. If you look at it today it looks like a pale but still mildly unpleasant version of Facebook. I would guess a high portion of the Myspace pages are not those of normal human beings. The are a mixture of  bots and corporates all out to extract some last cent in vast quantities from some hair thin opportunity.  At least only 25M of us have been fooled by it.

Meanwhile, Facebook was growing. Facebook had the good sense not allow the type of  user generated content that had made MySpace so unpleasant. It started as a private club on some quite unsound social footings at Harvard.  Then it invited only a select few higher Ed institutions, first in the US, later overseas. Cambridge, where I now work was added relatively late on. It was a cool and fun place to be to start with. I remember being told by one particularly sharp Cambridge student who stared at me when I suggested she might like to use a reading list app, and told me to “stay out of my Facebook”. I was now older certainly not cool. As it opened up every wanted to have a Facebook page, and everyone did. 500M of us, oh no wait, that’s  now 900M of us, sorry I blinked. Or did we. The IPO had a valuation that predicted revenue streams in the gazillions, but the post IPO headache is kicking in strong for some. When Facebook was relatively small and raw, few were interested. Friends meant something. You could be very certain that everyone you could connect with was a normal human. With faith in human nature all have some good qualities. Those qualities made them worth connecting with. Facebook will eventually go the way of MySpace as in attempting to extract value from is product (us and our connections), it has attracted millions of fake us’es and devalued both the concept of friend and what it could sell those connections for. Everyone is born with a finite amont of  “Friend” currency. We all choose how to spend it. Some invest in 4 or 5 lifelong friends who stick together through thick and thin. Others spread it thinly over 100s. Perhaps the celebs who feel lonely have spread their Friend currency so thinly they cant identify and real friends any more.  No one has 500 valuable friends.

This is the evolution of networks. There is only a finite amount of value for any one node within the network. Initially in an unconnected state there will be lots of potential, but as it grows and each node makes more connections. Outside entities step in to extract value from those connections. As the connections are made, the value gets dissipated and lost until the space returns to primeval slime. Financial markets have a word for this. Arbitrage. Eventually every last opportunity is exploited until all is balanced and there are no opportunities left. Beyond that point we venture into an unreal world of fake promisses and pyramid selling.

I hope we can learn from how past networks have evolved. Twitter almost did during the Arab Spring. Sadly it too has been overrun by bots and organisations vying to exploit and extract. G+ might prove sustainable, but what else ? Coursera? Academia is a proven breeding ground.