Node.js vs SilkJS

28 09 2012

synchronous ducks

Node.js, everyone on the planet has heard about. Every developer at least. SilkJS is relatively new and creates an interesting server to compare Node.js against because it shares so much of the same code base. Both are based on the Google V8 Javascript engine that convert JS into compiled code before executing. Node.js as we all know uses a single thread that uses a OS level event queue to process events. What is often overlooked is that Node.js uses a single thread, and therefore a single core of the host machine. SilkJS is a threaded server using pthreads where each thread processes the request leaving it upto the OS to manage interleaving between threads while waiting for IO to complete. Node.js is often refereed to as Async and SilkJS is Sync. The advantages to both approaches that are the source of many flame wars. There is a good summary of the differences and reasons for each approach on the SilkJS website. In essence SilkJS claims to have a less complex programming model that does not require the developer to constantly think of everything in terms of events and callbacks in order to coerce a single thread into doing useful work whilst IO is happening. Although this approach hands the interleaving of IO over to the OS letting it decide when each pthread should be run. OS developers will argue that thats what an OS should be doing and certainly to get the most out of modern multicore hardware there is almost no way of getting away from the need to run multiple processes or threads to use all cores. There is some evidence in the benchmarks (horror, benchmarks, that’s a red rag to a bull!) from Node.js, SilkJS, Tomcat7, Jetty8, Tornado etc that using multiple threads or processes is a requirement for making use of all cores. So what is that evidence ?

Well, first read why not to trust benchmarks http://webtide.intalio.com/2010/06/lies-damned-lies-and-benchmarks-2/ once you’ve read that lets assume that everyone creating a benchmark is trying to show their software off best.

The Node.js 0.8.0 gives a request/second benchmark for a 1K response at 3585.62 request/second. http://blog.nodejs.org/2012/06/25/node-v0-8-0/

Over at Vert.x there was an of Vert.x and Node.js showing Vert.x running at 300,00 requests/s. You do have to take it with a pinch of salt after you have read another post http://webtide.intalio.com/2012/05/truth-in-benchmarking/ with some detailed analysis that points out testing performance on the same box with no network and no latency is theoretically interesting, but probably not informative for the real world. What is more important is can the server stand up reliably forever with no downtime and perform normal server side processing.

So the SilkJS benchmarks in one of its more reasonable benchmarks claim it runs at around 22,000 request per second delivering 13K of file from disk with a very high levels of concurrency 20000. Again its hard to tell how true the benchmark is since many of those requests are pipelined (no socket open overhead), but one thing is clear. With a server capable of handling that level of concurrency some of the passionate arguments supporting async servers running one thread per core are lost. Either way works.

There is a second side to the SilkJS claims that bears some weight. With 200 server threads, what happens when one dies or needs to do something that is not IO bound? Something mildly non trivial that might use a tiny bit of CPU. With 1 server thread we know what happens, the server queues everything up while the on server thread does that computation. With 200, the OS manages the time spent working on the 1 thread. There is a simple answer, offload anything that does and processing to a threaded environment, but then you might as well use an async proxy front end to achieve the same.

There is a second part of the SilkJS argument that holds some weight. What happens when 1 of the SilkJS workers dies? Errors that kill processes happen for all sorts of reasons, some of them nothing to do with the code in the thread. With 199 threads the server continues to respond, with 0 it does not. At this point everyone who is enjoying the single-threaded simplicity of an async server will, I am sure, be telling me their process is so robust it will never die. That may well be true, but process sometimes dont always die, sometimes they get killed. The counter argument is, what happens when all 199 threads are busy running something. The threaded server dies.

To be balanced, life in an async server can be wonderfully simple. There is absolutely no risk of thread contention since there is only ever one thread, and it doesn’t matter how long a request might be pending for IO for as all IO is theoretically non blocking. It doesn’t mater how many requests there are provided there is enough memory to represent the queue. Synchronous servers can’t do long requests required by WebSockets and CometD. Well they can, but the thread pool soon gets exhausted. The ugly truth is that async servers also have something that gets exhausted  Memory. Every operation in the event queue consumes valuable memory, and with many garbage collected system, garbage collection is significant. Although it may not be apparent at light loads, at heavy loads even if CPU and IO are not saturated, async servers suffer from memory exhaustion and or garbage collection trying to avoid memory exhaustion, which, may appear as CPU exhaustion. So life is not so simple, thread contention is replaced by memory contention which is arguably harder to address.

So what is the best server architecture for modern web application?

An architecture that uses threads for requests that can be processed and delivered in ms, consuming no memory and delegating responsibility for interleaving IO to the OS, the resident expert at that task. Coupled with an architecture that recognises long IO intensive requests as such and delegates them to async part of the server, and above all, an architecture on which a simple and straightforward framework can be built to allow developers to get on with the task of delivering applications at webscale, rather than wondering how to achieve webscale with high load reliability. I don’t have an answer, other than it could be built with Jetty, but I know one thing, the golden bullets on each side of this particular flame war are only part of the solution.





Languages and Threading models

17 05 2012

Since I emerged from the dark world of Java where anything is possible I have been missing the freedom to do whatever I wanted with threads to exploit as many  cores that are available. With a certain level of nervousness I have been reading commentary on most of the major languages surrounding their threading models and how they make it easy or hard to utilize or waste hardware resources. Every article I read sits on a scale somewhere between absolute truth to utter FUD. The articles towards the FUD end of the scale always seem to benchmarks created by the author of the winning platform, so are easy to spot. This post is not about which language is better or what app server is the coolest thing, its a note to myself on what I have learnt, with the hope if I have read to much FUD, someone will save me.

To the chase; I have looked at Java, Python touched on Ruby and thought about serving pages in event based and thread based modes. I am only considering web applications, serving large numbers of users and not thinking about compute intensive, massively parallel or GUI apps. Unless you are lucky enough to be able to fit all your data into memory or even shard the memory over a wide scale cluster, the web application will become IO bound. Even if you have managed to fit al data into core memory you will still be IO bound on output as core memory and CPU bandwidth will forever exceed that of networks, and 99% of webapps are not CPU intensive. If it was not that way, the MPP code I was working on in 1992, would have been truly massively parallel, and would have found a cure for Cancer the following year. How well a language performs as the foundation to a web application is down to how well that language manages the latencies introduced by non core IO and not how efficiently optimises inner loops. I am warming to the opinion that all languages and most web application frameworks are created equal in this respect, and its only in the presentation of what they do where there is differentiation. An example. A Python based server running in a threaded mode compared to Node.js.

Some background. Node.js uses the Chrome Javascript engine that predicts patterns of JS code and converts them into C. It runs as a single thread inside a process on one core, delivering events to code that perform work exclusively until they encounter some code that releases control back to the core event dispatch, normally by returning from the event handling code. The core of Node.js generally uses an efficient event dispatch mechanism built into the OS. (epoll, kqueue etc). There is no internal threading within a Node.js proces and to use multicore hardware you must fork separate OS level processes which communicate over lightweight channels. Node.js gets is speed from ensuring that the single thread is never blocked by IO from doing work. The moment that happens the single thread in Node.js moves on to performing some other useful work. Being a single process it never has to think about inter-thread locking. That is my understanding of Node.js

Python (and Ruby to some extents), when running as a single process allows the user to create threads. By default these are OS level threads (pthreads) although there are other models available. I am talking only about pthreads here which dont require programmer intervention. Due to the nature of the Python interpreter there is a global lock (GIL) that only allows 1 python thread to use the interpreter at a time. Threads are allowed to use the interpreter for a set time after which they are rescheduled. Even if you run a python process on a multicore system, my understanding is, only 1 thread per process will execute at a time. When a thread enters blocking IO it releases the lock allowing other threads to execute. Like Node.js, to make full use of multicore hardware you must run more than one Python processor. Unlike Node.js the internal implementation of the interpreter and not the programming style ensures that the CPU running the python process switches between threads to ensure its always performing useful work. In fact thats not quite true, since the IO libraries in Node.js have to relinquish control back to the main event loop to ensure they do not block.

So provided, the mechanism for delivering work to the process is event based there is little difference in the potential for Ruby, Python or Node.js to utilize hardware resources effectively. They all need 1 process per hardware core. Where they differ is how the programmer ensures that control is released on blocking. With Python (and Ruby IIUC), control is released by core interpreter with out the programmer even knowing it is happening. With Node.js control is released by the programmer invoking a function that explicitly passes control back. The only thing a Python programmer has to ensure is that there are sufficient threads in the process for the PIL to pass control to when IO latencies are encountered, and that depends on the deployment mechanism which should be multi-threaded. The only added complication for the Node.js model is that the IO drivers need to ensure that every subsystem that performs blocking IO has some mechanism of storing state not bound to a thread (since there is only 1). A database transaction, for one request must not interact with that for another. This is no mean feat and I will guess (not having looked) is simular to the context switching process between native OS level threads. The only thing you cant do in Node.js is perform a compute intensive task without releasing control back to the event loop. Doing that stops a Node.js from serving any other requests. If you do that in Python, the interpreter suspends the pthread and reschedules after a set number of instructions. Proof, in some senses that multitasking is a foundation of the language rather than an artifact of the programmers code base.

The third language I mentioned is Java. Having spent most of my the last 16 years coding Java based apps I have enjoyed the freedom to be able to use every hardware core available from a single process all sharing the same heap. I have also suffered the misery of having to deal with interleaving IO, synchronization and avoiding blocking over shared resources. Java is unlike the other languages in this respect since it gives the programmer the tools and the responsibility to make best use of the hardware platform. Often that tempts the programmer to think they can be successful in eliminating all blocking IO by eliminating all non core memory IO. The reality is somewhat different, as no application that scales and connects humans together will ever have few enough connections between data to localise all the data used in a request to a single board of RAM. From my MPP years this was the domain decomposition bandwidth. It may be possible to eliminate IO from disk, but I have to doubt that a non trivial application can eliminate all backend network IO. In a sense, the threading model of Java tempts the developer to try and implement efficient hardware resource utilization, but doesn’t help them in doing so. The same can be said for many of lower level compiled languages. Fast and dangerous.

Don’t forget, with web applications, it’s IO that matters.