PyOAE

10 04 2012

For those that have been watching my G+ feed you will have noticed some videos being posted. Those vidoes are of the OAE 1.2 UI running on a server developed in Python using DJango. I am getting a increasing stream of questions about what it is, hence this blog post.

What is PyOAE?

PyOAE is a re-implementation of the OAE server using DJango and a fully relational database schema. I use PostgeSQL and/or SqlLite3 but it would probably wok on any RDBMS supported by DJango. The implementation uses the OAE 1.2.0 UI code (the yet to be released 1.2.0 branch) as its specification and runs that UI unmodified.

Is it a port of Nakamura to Python?

It is not a port of the Java code base called Nakamura that is the official server of OAE and shares no common code or concepts. It does not contain the SparseMap content system or a Python port of SparseMap.

When will it be released ?

It is not, yet, functionally complete and so is not ready for release. When it is, I will release it.

Will it scale ?

The short answer is I don’t have enough evidence to know at this stage. I have some circumstantial evidence and some hard evidence that suggests that there is no reason why a fully RDBMS model using DJango might not scale.

Circumstantial: DJango follows the same architectural model used by many LAMP based applications. Those have been shown to be amenable to scaling. WordPress, Wikipedia etc. In the educational space MoodleRooms scaled Moodle to 1M concurrent users with the help of MySQL and others.

Circumstantial: Sakai CLE uses relational storage and scales to the levels required to support the institutions that want to use it.

Circumstantial: DJango has been used for some large, high traffic websites.

Hard: I have loaded the database with 1M users and the response time shows no increase wrt to the number of users.

Hard: I have loaded the database with 50K users and 300K messages with no sign of increasing response time.

Hard: I have loaded the content store with 100K users and 200K content items and seen no increase in response time.

However, I have not load tested with 1000’s of concurrent users.

Will it support multi tenancy ?

Short answer is yes, in the same way that WordPress supports multi tenancy, but with the potentially to support it in other ways.

Is it complex to deploy ?

No, its simple and uses any of the deployment mechanisms recommended by DJango.

Does it use Solr?

At the moment it does not, although I havent implemented free text searching on the bodies of uploaded content. All queries are relational SQL generated by DJango’s ORM framework.

What areas does it cover?

Users, Groups, Content, Messaging, Activities, Connections are currently implemented and tested. Content Authoring  is partially supported and I have yet to look at supporting World. Its about 7K lines of python including comments covering about 160 REST endpoints with all UI content being served directly from disk. There are about 40 tables in the database.

What areas have you made different ?

Other than the obvious use of an RDBMS for storage there are 2 major differences that jumps out. The content system is single instance. ie if you upload the same content item 100 times, its only stored once and referenced 100 times. The second is that the usernames are not visible in any http response from the server making it impossible to harvest usernames from an OAE instance. This version uses FERPA safe opaque IDs. Due to some hard coded parts of the UI I have not been able to do the same for Group names which can be harvested.

Could a NoSQL store be used ?

If DJango has an ORM adapter for the NoSQL store, then yes although I havent tried, and I am not convinced its necessary for the volume of records I am seeing.

Could the same be done in Java?

Probably, but the pace of development would be considerably less (DJango development follows a edit-save-refresh pattern and requires far fewer lines of code than Java). Also, a mature framework that had solved all the problems DJango has addressed would be needed to avoid the curse of all Java developers, the temptation to write their own framework.

Who is developing it?

I have been developing it in my spare time over the past 6 weeks. It represents abut 10 days work so far.

Why did you start doing it?

In January 2012 I heard that several people whose opinion I respected had said Nakamura should be scrapped and re-written in favour of a fully relational model I wanted to find out if they were correct. So far I have found nothing that says they are wrong. My only regret is that I didn’t hear them earlier and I wish I had tried this in October 2010.

Why DJango ?

I didn’t want to write yet another framework, and I don’t have enough free time to do this in Java. DJango has proved to have everything I need, and Python has proved to be quick to develop and plenty fast enough to run OAE.

Didn’t you try GAE ?

I did try writing a Google App Engine backend for OAE, however I sound found two problems. The UI requirements for counting and querying relating to Groups were incompatible with BigStore. I also calculated a GAE hosted instance would rapidly breach the free GAE hosting limits due to the volume of UI requests. Those two factors combined to cause me to abandon a GAE based backend.

Advertisements




Deploying/Tuning SparseMap Correctly: Update

1 03 2012

In the last post I reported the impact of using SparseMap with caching disabled, but at the same time noticed that there was a error in the way in which JDBC connections where handled. SparseMap doesn’t pool JDBC connections. It makes the assumption that anyone using SparseMap will read the Session API and note that Sessions should only be used on one thread during their active life time. Although the SparseMap Session Impl is thread safe, doing that eliminates all blocking synchronization and concurrent locks. If the Session API use followed, then JDBC connections can be taken out of the pool and bound to threads which ensures that only 1 thread will ever access a JDBC connection at any one time. When bound to threads, JDBC sessions can be long lived and so are only revalidated if they have been idle for some time. If any connection is found to be in a error state its replaced with a fresh connection.

In Sakai OAE 1.2, with SparseMap caching inadvertently disabled, and JDBC connections being validated on every request, there were about 500,000 SQL operations in the short load test.  With those issues addressed, the number of SQL operations drops to 6600, removing almost 1000s (over 15 minutes) from the 1h load test execution time and removing JDBC entirely from the list of JVM hot spots. Notice that in the last hotspot SparseMap is showing almost no time being blocked by sync operations, although there is more time spent suspended than I would like which needs to be investigated. I cannot stress how important it is to make certain that caching is working properly if you are using SparseMap.  Here are the results, which I cant take any credit for. The Load testing was performed by Jon Cook at Indiana University, full details can be found at https://uisapp2.iu.edu/confluence-prd/display/ONC/OAE+Evaluation+Load+Test+Results

SQL Profile Before Caching was enabled correctly.

SQL Profile After Caching was enabled correctly

JVM Hotspots before Caching was enabled correctly





Deploying/Tuning SparseMap Correctly

26 02 2012

Deploying

SparseMap is designed to maintain a shared cache of recently accessed maps in memory. The code base itself is also designed to use a little memory as possible. The SparseMap app server runs happily at load in 20MB of heap. Sakai OAE which is the main user of SparseMap uses a little more than that (around 200MB) leaving the remainder of heap available for caching. If caching is working correctly, and there is sufficient heap available for the number of active users the profile of calls to the storage layer should show almost no reads and a low level of writes. If however a mistake is made then the impact is dramatic. The first trace here shows Sakai OAE as of 2nd Feb 2012 running SparseMap 1.3 with a missconfigured cache setup. The image shows the SQL report from Neoload.

You can see that there a colossal number of SQL statements performing a query on parent hash and there are also massive number of other queries. Obviously something is not right.

Compare that with Sakai OAE on 23 Feb 2012 running SparseMap 1.5 with caching configured

The query profile has completely changed with almost everything being served from cache in this test. The 282189 queries taking 577s for parenthash has become 325 queries taking 0.645s The message here is, dont deploy SparseMap without caching enabled, and check that it is enabled and sized correctly. There are periodic log statements coming from SparseMap will indicate the performance of the cache which should always be running at over 80% hit rate.

Tuning

SparseMap comes with a default configuration for SQL and DDL. It may be perfectly OK for most installations never needing any tuning, but the design and implementation assumed that deployers would tune both the DDL and SQL.

Tuning DDL

The DDL that comes with the RDBMS drivers is a default SQL schema. It makes the assumption that the deployment is going to be small to medium in size and probably never see more than 1M content items. If after sizing a production deployment its clear that the application will contain more than 1M items then some tuning of this DDL must be done. How much depends on how big the installation will be. The internal structure of SparseMap was designed to use database shards in the same way that YouTube’s metdata store does. The sharding is performed on the first 2 characters of the row key giving a theoretical maximum number of shards of 64^^2, although the configuration file will become unmanageable with that many shards.

Even if sharding is not required, the indexing operations within SparseMap will need tuning. If SparseMap should only be configured to index the column values that the application needs to index. By default there is only a single very wide indexing table which can become extremely inefficient. Its columns are by default large varchars and for many situations this will be very slow and wasteful. Once the client application knows what its indexing tables need to look like it should create those tables before SparseMap starts up so that each (yes there supposed to be more than one index table) table is dense and efficient with  properties that are queries and indexed by the same use case living in the same table. If the application has chosen to use non real time querying (eg by using Solr), then it should ensure that SparseMap is not unnecessarily indexing data that it will never use.

Tuning SQL

One of the main requirements for SparseMap was that it would allow UI devs to store and query any unstructured data without having to write a line of Java code. Consequently the queries it generates are not the most efficient. The design always assumed that deployers would at some time need to tune the SQL the application was hitting the database with, and they would also want to do that without touching a line of Java code. All the SQL is in property files to allow deployers to tune in production.

To assist a deployer in tuning, users of SparseMap should name all their queries by adding setting the property “_statementset” on each query to a different name. If that is done, then the deployer can bind customised tuned queries to each value of _statementset. The binding of queries also takes account of sharding if the storage layer has been sharded.

This only gives an introduction to deploying and tuning SparseMap. I would be astounded if SparseMap was so elegant that it could be deployed into production without ensuring caching was configured correctly, the DDL was adjusted and the SQL was tuned were appropriate.

For the observant amongst you the NeoLoad SQL reports have revealed a rather obvious bug that needs attention. SQL connections are bound to threads not to the pooled storage client implementation. When the storage client implementation is borrowed from the pool, the SQL connection associated with the thread performing the borrow operation is validated, in this case, Oracle, select 1 from dual is executed. Since the connection is thread bound this is unnecessary and accounts 496s or the 509s consumed by the load test. I intend to remove most of that in the next release as the validation approach is incorrect, and does not protect against SQL failure mid request. Correctly configured caching did, thankfully, reduce the SQL portion of this load test from 1064s down to 509s, although I think it should be possible to reduce that to arround 120s. At which point the load test will need to be upgraded.

 

 

Update: 29 Feb 2012

The high levels of select 1 from dual have been eliminated. Duffy Gillman from rSmart did the bulk of the work a few months ago in SparseMap 1.5 however he/we/I failed to remove all the places where connections were verified unnecessarily. The issue was fixed with commit  778a0bfe97963dccf46566a431853bab6f7c87cc which is available in the master branch of SparseMap for Sakai OAE to merge into their fork of the codebase. Hopefully, the improvement will show up in the next round of load testing, and should also remove JDBC as the top hotspot.