Is ORM So bad?

12 06 2012

ORM gets a bad name, and most of the time it deserves a bad name. It produces nasty horrible queries that don’t scale and lead to dog slow applications in production. Well that’s not entirely fair. Programmers who write code or SQL and never bother to check if they have made stupid mistakes lead to dog slow applications in production. The problem with ORM is it puts lots of power into a programmers hands, lets them loose and makes them think that they can get away without thinking about what will happen when their application has more than 10 rows in a table. There is no magic to ORM, just like raw SQL, you have to tune it.

This is not rocket science. Load the tables you are making a query against with a representative number of records, run the query and then tune the query again, and again, and again. Then run the query concurrently, then run the query with updates concurrently. And tune, tune, tune. To do that, you have to read the manual on how to tune your chosen ORM in exactly the same way you will have read the tuning manual for your chosen RDBMS. There is hardly any difference between tuning raw SQL and tuning ORM generated queries.

Where ORMs differ, get a very bad name, and make many give up a code SQL directly, is how hard or easy they make it for you to both tune the query and make the query deliver precisely the results required. The bad ORMs (no naming and shaming here), make that painfully impossible. The good ORMs firstly generate queries that by and large can only be bettered by the best DBA, and if they occasionally fail, those ORMs are easy to tweak.

If your thinking it would be easier to use some other query mechanism you will probably have to invest just as much time and effort tuning either code, update strategies or some other query language. There are really no magic bullets when it comes to making queries against large data sets go fast, unless your problem is trivial and pointer based.

So how do I tune ORM (or raw SQL for that matter) ?

There is little point in spending hours squezing the last ms out of every query. It’s the worst queries that need attention first. One approach I use on DjOAE, an app using Django ORM is to set some parameters, eg: GET requests must take less than 10ms and perform no more than 10 SQL operations; POST request may take upto 50ms and 30 SQL operations. If any request breaches those limits all the raw SQL with timing information is dumped (preferably painted red). With the database suitable constrained to make it work (ie made very small, with almost no cache so anything that isn’t a lookup runs like a dog), and loaded with representative data any operation that needs tuning sticks out like a sore thumb. When I don’t see any more, I drop the thresholds, up the concurrency, raise the data set size.

Once you have the detailed evidence, it’s as easy as falling off a log, and way more satisfying. You could probably do the same with any ORM, even Hibernate.

What about caching query sets in the app ?

Caching should really be turned off for tuning. An application that can’t work without caching at the query layer will generally not scale as at some point the cache will suffer too much contention/invalidation/replication (gulp, no not replication, you must be kidding) and the underlying slow queries will be exposed. Where caching is worth it, is where the query is already fast, but repeated billions of times. If your using caching as a query tuning strategy, that’s fine, bit it will bite you in the end.





PyOAE renamed DjOAE

2 05 2012

I’ve been talking to several folks since my last post on PyOAE and it has become clear that the name doesn’t convey the right message. The questions often center around the production usage of a native Python webapp or the complexity of writing your own framework from scratch. To address this issue I have renamed PyOAE to DjOAE to reflect its true nature.

It is a DJango web application and the reason I chose DJango was because I didn’t want to write yet another framework. I could have chosen any framework, even a Java framework if such a thing existed, but I chose Django because it has good production experience with some large sites, a vibrant community and has already solved most of the problems that a framework should have solved.

The latest addition to that set of problems already solved, that I have needed is data and schema migration. DjOAE is intended to be deployed in a DevOps like way with hourly deployments  if needed. To make that viable the code base has to address schema and data migrations as they happen. I have started to use South that not only provides a framework for doing this, but automates roll forward and roll back of database schema and data (if possible). For the deployer the command is ever so simple.

python manage.py migrate

Which queries the database to work out where it is relative to the code and then upgrades it to match the code.

This formalizes the process that has been used for years in Sakai CLE into a third party component used by thousands and avoids the nightmare scenario where all data migration has to be worked out when a release is performed.

I have to apologise to anyone upstream for the name change as it will cause some disruption, but better now than later. Fortunately clones are simple to adjust, as git seems to only care about the commit sha1 so a simple edit to .git/config changing

url = ssh://git@bitbucket.org/ieb/pyoae.git
to
url = ssh://git@bitbucket.org/ieb/djoae.git

should be enough.

If you are the standard settings you will need to rename your database. I did this with pgAdminIII without dropping the database.





PyOAE

10 04 2012

For those that have been watching my G+ feed you will have noticed some videos being posted. Those vidoes are of the OAE 1.2 UI running on a server developed in Python using DJango. I am getting a increasing stream of questions about what it is, hence this blog post.

What is PyOAE?

PyOAE is a re-implementation of the OAE server using DJango and a fully relational database schema. I use PostgeSQL and/or SqlLite3 but it would probably wok on any RDBMS supported by DJango. The implementation uses the OAE 1.2.0 UI code (the yet to be released 1.2.0 branch) as its specification and runs that UI unmodified.

Is it a port of Nakamura to Python?

It is not a port of the Java code base called Nakamura that is the official server of OAE and shares no common code or concepts. It does not contain the SparseMap content system or a Python port of SparseMap.

When will it be released ?

It is not, yet, functionally complete and so is not ready for release. When it is, I will release it.

Will it scale ?

The short answer is I don’t have enough evidence to know at this stage. I have some circumstantial evidence and some hard evidence that suggests that there is no reason why a fully RDBMS model using DJango might not scale.

Circumstantial: DJango follows the same architectural model used by many LAMP based applications. Those have been shown to be amenable to scaling. WordPress, Wikipedia etc. In the educational space MoodleRooms scaled Moodle to 1M concurrent users with the help of MySQL and others.

Circumstantial: Sakai CLE uses relational storage and scales to the levels required to support the institutions that want to use it.

Circumstantial: DJango has been used for some large, high traffic websites.

Hard: I have loaded the database with 1M users and the response time shows no increase wrt to the number of users.

Hard: I have loaded the database with 50K users and 300K messages with no sign of increasing response time.

Hard: I have loaded the content store with 100K users and 200K content items and seen no increase in response time.

However, I have not load tested with 1000’s of concurrent users.

Will it support multi tenancy ?

Short answer is yes, in the same way that WordPress supports multi tenancy, but with the potentially to support it in other ways.

Is it complex to deploy ?

No, its simple and uses any of the deployment mechanisms recommended by DJango.

Does it use Solr?

At the moment it does not, although I havent implemented free text searching on the bodies of uploaded content. All queries are relational SQL generated by DJango’s ORM framework.

What areas does it cover?

Users, Groups, Content, Messaging, Activities, Connections are currently implemented and tested. Content Authoring  is partially supported and I have yet to look at supporting World. Its about 7K lines of python including comments covering about 160 REST endpoints with all UI content being served directly from disk. There are about 40 tables in the database.

What areas have you made different ?

Other than the obvious use of an RDBMS for storage there are 2 major differences that jumps out. The content system is single instance. ie if you upload the same content item 100 times, its only stored once and referenced 100 times. The second is that the usernames are not visible in any http response from the server making it impossible to harvest usernames from an OAE instance. This version uses FERPA safe opaque IDs. Due to some hard coded parts of the UI I have not been able to do the same for Group names which can be harvested.

Could a NoSQL store be used ?

If DJango has an ORM adapter for the NoSQL store, then yes although I havent tried, and I am not convinced its necessary for the volume of records I am seeing.

Could the same be done in Java?

Probably, but the pace of development would be considerably less (DJango development follows a edit-save-refresh pattern and requires far fewer lines of code than Java). Also, a mature framework that had solved all the problems DJango has addressed would be needed to avoid the curse of all Java developers, the temptation to write their own framework.

Who is developing it?

I have been developing it in my spare time over the past 6 weeks. It represents abut 10 days work so far.

Why did you start doing it?

In January 2012 I heard that several people whose opinion I respected had said Nakamura should be scrapped and re-written in favour of a fully relational model I wanted to find out if they were correct. So far I have found nothing that says they are wrong. My only regret is that I didn’t hear them earlier and I wish I had tried this in October 2010.

Why DJango ?

I didn’t want to write yet another framework, and I don’t have enough free time to do this in Java. DJango has proved to have everything I need, and Python has proved to be quick to develop and plenty fast enough to run OAE.

Didn’t you try GAE ?

I did try writing a Google App Engine backend for OAE, however I sound found two problems. The UI requirements for counting and querying relating to Groups were incompatible with BigStore. I also calculated a GAE hosted instance would rapidly breach the free GAE hosting limits due to the volume of UI requests. Those two factors combined to cause me to abandon a GAE based backend.





Modern WebApps

12 03 2012

Modern web apps. like it or not, are going to make use of things like WebSockets. Browser support is already present and UX designers will start requiring that UI implementations get data from the server in real time. Polling is not a viable solution for real deployment since at a network level it will cause the endless transfer of useless data to and from the server. Each request asking every time, “what happened ?” and the server dutifully responding “like I said last time, nothing”. Even with minimal response sizes, every request comes with headers that will eat network capacity. Moving away from the polling model will be easy for UI developers working mostly in client and creating tempting UIs for a handfull of users. Those attractive UIs generate demand and soon the handfull of users become hundreds or thousands. In the past we were able to simply scale up the web server, turn keep alives off, distribute static content and tune the hell out of each critical request. As WebSockets become more wide spread, that won’t be possible. The problem here is that web servers have been built for the last 10 years on a thread per request model, and many supporting libraries share that assumption. In the polling world that’s fine, since the request gets bound to the thread, the response is generated as fast as possible, and the the thread is unbound. Provided the response time is low enough the request throughput of the sever will be maintained high enough to service all requests without exausting the OS’s ability to manage threads/processes/open resources.

Serving a WebSocket request with the same model is a problem. The request is bound to a thread, the response is not generated  as it waits, mid request, pending some external event. Some time later, that event happens and the response is delivered back to the client. The traditional web server environment will have to expect to be able to support as many concurrent requests on your infrastructure as there are users who have a page pointing to your sever on one of the many tabs they have open. If you have 100K users with a browser window open on a page where you have a WebSocket connection, then the hosting infrastructure will need to support 100K in progress requests. If the webserver model is process per request, somehow you have to provide resources to support 100K OS level processes. If its thread per request, then 100K threads. Obviously the only way of supporting this level of idle but connected requests is to use an event processing model. But that creates problems.

For instance, anyone writing PHP code will know it will probably on only run in process per worker mode as many of the PHP extensions are not thread safe. Java servlets are simular although changes in the Servlet 3 spec have constructs to release the processing thread back to the container, although many applications are still being developed on Servlet 2.4 and 2.5, and most frameworks are not capable of suspending requests. Python using mod_wsgi doesn’t have a well defined way of releasing the processing thread back to the server although there is some code originating from Google that uses mod_python to manipulate the connection and release the thread back to Apache Httpd.

There are new frameworks (eg Node.js) that address this problem and there is a considerable amount of religion surrounding their use. The believers able to show unbelievable performance levels on benchmark test cases and the doubters able to point to unbelievably complex and unfathomable real application code. There are plenty of other approaches to the same problem that avoid spagetti code, but the fundamental message is, that to support WebSockets at the server side an event based processing model has to be used, that is the direct opposite to how web applications have been delivered to date, and regardless of the religion, that creates a problem for deployment.

Deployment of this type of application demands that WebSocket connections are can be unbound from the thread servicing the request, when it becomes a WebSocket connection. The nasty twist is that every box handling the request needs to be able to do that, including any WebTiers or load balancers, and any HTTP connection can be converted from the Http protocol into the WebSocket protocol during the request. Fortunately, sensible applications will only support WebSocket on known URLs which gives the LB and WebTiers an oppertunity to route, but prior to routing every component in the chain must be using a small number of threads servicing a large number of open and active sockets.

This doesn’t mean that an entire application framework must be thrown away, but it does mean that whatever is handling the WebSocket request, upgrade and eventual connection must be event based. This also doesn’t mean that everyone must learn how to read and write spaghetti code in managing every aspect of threading threading, concurrency in communication re-writing every library to be non-blocking and asynchronous. Fortunately there are some extremely capable epoll based containers (including Node.js, other than its insistance to use JS) that can be used either as WebTier proxies or ultimate endpoints. Some of them, such as the Python based Tornado server will frameworks supporting the mod_wsgi standard and hence capable of running Django based applications for the non WebSocket portion. As can be seen from real benchmarks, these servers offer performance level expected of event based processing and support for traditional frameworks with real blocking resource connections.