Languages and Threading models

17 05 2012

Since I emerged from the dark world of Java where anything is possible I have been missing the freedom to do whatever I wanted with threads to exploit as many  cores that are available. With a certain level of nervousness I have been reading commentary on most of the major languages surrounding their threading models and how they make it easy or hard to utilize or waste hardware resources. Every article I read sits on a scale somewhere between absolute truth to utter FUD. The articles towards the FUD end of the scale always seem to benchmarks created by the author of the winning platform, so are easy to spot. This post is not about which language is better or what app server is the coolest thing, its a note to myself on what I have learnt, with the hope if I have read to much FUD, someone will save me.

To the chase; I have looked at Java, Python touched on Ruby and thought about serving pages in event based and thread based modes. I am only considering web applications, serving large numbers of users and not thinking about compute intensive, massively parallel or GUI apps. Unless you are lucky enough to be able to fit all your data into memory or even shard the memory over a wide scale cluster, the web application will become IO bound. Even if you have managed to fit al data into core memory you will still be IO bound on output as core memory and CPU bandwidth will forever exceed that of networks, and 99% of webapps are not CPU intensive. If it was not that way, the MPP code I was working on in 1992, would have been truly massively parallel, and would have found a cure for Cancer the following year. How well a language performs as the foundation to a web application is down to how well that language manages the latencies introduced by non core IO and not how efficiently optimises inner loops. I am warming to the opinion that all languages and most web application frameworks are created equal in this respect, and its only in the presentation of what they do where there is differentiation. An example. A Python based server running in a threaded mode compared to Node.js.

Some background. Node.js uses the Chrome Javascript engine that predicts patterns of JS code and converts them into C. It runs as a single thread inside a process on one core, delivering events to code that perform work exclusively until they encounter some code that releases control back to the core event dispatch, normally by returning from the event handling code. The core of Node.js generally uses an efficient event dispatch mechanism built into the OS. (epoll, kqueue etc). There is no internal threading within a Node.js proces and to use multicore hardware you must fork separate OS level processes which communicate over lightweight channels. Node.js gets is speed from ensuring that the single thread is never blocked by IO from doing work. The moment that happens the single thread in Node.js moves on to performing some other useful work. Being a single process it never has to think about inter-thread locking. That is my understanding of Node.js

Python (and Ruby to some extents), when running as a single process allows the user to create threads. By default these are OS level threads (pthreads) although there are other models available. I am talking only about pthreads here which dont require programmer intervention. Due to the nature of the Python interpreter there is a global lock (GIL) that only allows 1 python thread to use the interpreter at a time. Threads are allowed to use the interpreter for a set time after which they are rescheduled. Even if you run a python process on a multicore system, my understanding is, only 1 thread per process will execute at a time. When a thread enters blocking IO it releases the lock allowing other threads to execute. Like Node.js, to make full use of multicore hardware you must run more than one Python processor. Unlike Node.js the internal implementation of the interpreter and not the programming style ensures that the CPU running the python process switches between threads to ensure its always performing useful work. In fact thats not quite true, since the IO libraries in Node.js have to relinquish control back to the main event loop to ensure they do not block.

So provided, the mechanism for delivering work to the process is event based there is little difference in the potential for Ruby, Python or Node.js to utilize hardware resources effectively. They all need 1 process per hardware core. Where they differ is how the programmer ensures that control is released on blocking. With Python (and Ruby IIUC), control is released by core interpreter with out the programmer even knowing it is happening. With Node.js control is released by the programmer invoking a function that explicitly passes control back. The only thing a Python programmer has to ensure is that there are sufficient threads in the process for the PIL to pass control to when IO latencies are encountered, and that depends on the deployment mechanism which should be multi-threaded. The only added complication for the Node.js model is that the IO drivers need to ensure that every subsystem that performs blocking IO has some mechanism of storing state not bound to a thread (since there is only 1). A database transaction, for one request must not interact with that for another. This is no mean feat and I will guess (not having looked) is simular to the context switching process between native OS level threads. The only thing you cant do in Node.js is perform a compute intensive task without releasing control back to the event loop. Doing that stops a Node.js from serving any other requests. If you do that in Python, the interpreter suspends the pthread and reschedules after a set number of instructions. Proof, in some senses that multitasking is a foundation of the language rather than an artifact of the programmers code base.

The third language I mentioned is Java. Having spent most of my the last 16 years coding Java based apps I have enjoyed the freedom to be able to use every hardware core available from a single process all sharing the same heap. I have also suffered the misery of having to deal with interleaving IO, synchronization and avoiding blocking over shared resources. Java is unlike the other languages in this respect since it gives the programmer the tools and the responsibility to make best use of the hardware platform. Often that tempts the programmer to think they can be successful in eliminating all blocking IO by eliminating all non core memory IO. The reality is somewhat different, as no application that scales and connects humans together will ever have few enough connections between data to localise all the data used in a request to a single board of RAM. From my MPP years this was the domain decomposition bandwidth. It may be possible to eliminate IO from disk, but I have to doubt that a non trivial application can eliminate all backend network IO. In a sense, the threading model of Java tempts the developer to try and implement efficient hardware resource utilization, but doesn’t help them in doing so. The same can be said for many of lower level compiled languages. Fast and dangerous.

Don’t forget, with web applications, it’s IO that matters.

 





Modern WebApps

12 03 2012

Modern web apps. like it or not, are going to make use of things like WebSockets. Browser support is already present and UX designers will start requiring that UI implementations get data from the server in real time. Polling is not a viable solution for real deployment since at a network level it will cause the endless transfer of useless data to and from the server. Each request asking every time, “what happened ?” and the server dutifully responding “like I said last time, nothing”. Even with minimal response sizes, every request comes with headers that will eat network capacity. Moving away from the polling model will be easy for UI developers working mostly in client and creating tempting UIs for a handfull of users. Those attractive UIs generate demand and soon the handfull of users become hundreds or thousands. In the past we were able to simply scale up the web server, turn keep alives off, distribute static content and tune the hell out of each critical request. As WebSockets become more wide spread, that won’t be possible. The problem here is that web servers have been built for the last 10 years on a thread per request model, and many supporting libraries share that assumption. In the polling world that’s fine, since the request gets bound to the thread, the response is generated as fast as possible, and the the thread is unbound. Provided the response time is low enough the request throughput of the sever will be maintained high enough to service all requests without exausting the OS’s ability to manage threads/processes/open resources.

Serving a WebSocket request with the same model is a problem. The request is bound to a thread, the response is not generated  as it waits, mid request, pending some external event. Some time later, that event happens and the response is delivered back to the client. The traditional web server environment will have to expect to be able to support as many concurrent requests on your infrastructure as there are users who have a page pointing to your sever on one of the many tabs they have open. If you have 100K users with a browser window open on a page where you have a WebSocket connection, then the hosting infrastructure will need to support 100K in progress requests. If the webserver model is process per request, somehow you have to provide resources to support 100K OS level processes. If its thread per request, then 100K threads. Obviously the only way of supporting this level of idle but connected requests is to use an event processing model. But that creates problems.

For instance, anyone writing PHP code will know it will probably on only run in process per worker mode as many of the PHP extensions are not thread safe. Java servlets are simular although changes in the Servlet 3 spec have constructs to release the processing thread back to the container, although many applications are still being developed on Servlet 2.4 and 2.5, and most frameworks are not capable of suspending requests. Python using mod_wsgi doesn’t have a well defined way of releasing the processing thread back to the server although there is some code originating from Google that uses mod_python to manipulate the connection and release the thread back to Apache Httpd.

There are new frameworks (eg Node.js) that address this problem and there is a considerable amount of religion surrounding their use. The believers able to show unbelievable performance levels on benchmark test cases and the doubters able to point to unbelievably complex and unfathomable real application code. There are plenty of other approaches to the same problem that avoid spagetti code, but the fundamental message is, that to support WebSockets at the server side an event based processing model has to be used, that is the direct opposite to how web applications have been delivered to date, and regardless of the religion, that creates a problem for deployment.

Deployment of this type of application demands that WebSocket connections are can be unbound from the thread servicing the request, when it becomes a WebSocket connection. The nasty twist is that every box handling the request needs to be able to do that, including any WebTiers or load balancers, and any HTTP connection can be converted from the Http protocol into the WebSocket protocol during the request. Fortunately, sensible applications will only support WebSocket on known URLs which gives the LB and WebTiers an oppertunity to route, but prior to routing every component in the chain must be using a small number of threads servicing a large number of open and active sockets.

This doesn’t mean that an entire application framework must be thrown away, but it does mean that whatever is handling the WebSocket request, upgrade and eventual connection must be event based. This also doesn’t mean that everyone must learn how to read and write spaghetti code in managing every aspect of threading threading, concurrency in communication re-writing every library to be non-blocking and asynchronous. Fortunately there are some extremely capable epoll based containers (including Node.js, other than its insistance to use JS) that can be used either as WebTier proxies or ultimate endpoints. Some of them, such as the Python based Tornado server will frameworks supporting the mod_wsgi standard and hence capable of running Django based applications for the non WebSocket portion. As can be seen from real benchmarks, these servers offer performance level expected of event based processing and support for traditional frameworks with real blocking resource connections.