Languages and Threading models

17 05 2012

Since I emerged from the dark world of Java where anything is possible I have been missing the freedom to do whatever I wanted with threads to exploit as many  cores that are available. With a certain level of nervousness I have been reading commentary on most of the major languages surrounding their threading models and how they make it easy or hard to utilize or waste hardware resources. Every article I read sits on a scale somewhere between absolute truth to utter FUD. The articles towards the FUD end of the scale always seem to benchmarks created by the author of the winning platform, so are easy to spot. This post is not about which language is better or what app server is the coolest thing, its a note to myself on what I have learnt, with the hope if I have read to much FUD, someone will save me.

To the chase; I have looked at Java, Python touched on Ruby and thought about serving pages in event based and thread based modes. I am only considering web applications, serving large numbers of users and not thinking about compute intensive, massively parallel or GUI apps. Unless you are lucky enough to be able to fit all your data into memory or even shard the memory over a wide scale cluster, the web application will become IO bound. Even if you have managed to fit al data into core memory you will still be IO bound on output as core memory and CPU bandwidth will forever exceed that of networks, and 99% of webapps are not CPU intensive. If it was not that way, the MPP code I was working on in 1992, would have been truly massively parallel, and would have found a cure for Cancer the following year. How well a language performs as the foundation to a web application is down to how well that language manages the latencies introduced by non core IO and not how efficiently optimises inner loops. I am warming to the opinion that all languages and most web application frameworks are created equal in this respect, and its only in the presentation of what they do where there is differentiation. An example. A Python based server running in a threaded mode compared to Node.js.

Some background. Node.js uses the Chrome Javascript engine that predicts patterns of JS code and converts them into C. It runs as a single thread inside a process on one core, delivering events to code that perform work exclusively until they encounter some code that releases control back to the core event dispatch, normally by returning from the event handling code. The core of Node.js generally uses an efficient event dispatch mechanism built into the OS. (epoll, kqueue etc). There is no internal threading within a Node.js proces and to use multicore hardware you must fork separate OS level processes which communicate over lightweight channels. Node.js gets is speed from ensuring that the single thread is never blocked by IO from doing work. The moment that happens the single thread in Node.js moves on to performing some other useful work. Being a single process it never has to think about inter-thread locking. That is my understanding of Node.js

Python (and Ruby to some extents), when running as a single process allows the user to create threads. By default these are OS level threads (pthreads) although there are other models available. I am talking only about pthreads here which dont require programmer intervention. Due to the nature of the Python interpreter there is a global lock (GIL) that only allows 1 python thread to use the interpreter at a time. Threads are allowed to use the interpreter for a set time after which they are rescheduled. Even if you run a python process on a multicore system, my understanding is, only 1 thread per process will execute at a time. When a thread enters blocking IO it releases the lock allowing other threads to execute. Like Node.js, to make full use of multicore hardware you must run more than one Python processor. Unlike Node.js the internal implementation of the interpreter and not the programming style ensures that the CPU running the python process switches between threads to ensure its always performing useful work. In fact thats not quite true, since the IO libraries in Node.js have to relinquish control back to the main event loop to ensure they do not block.

So provided, the mechanism for delivering work to the process is event based there is little difference in the potential for Ruby, Python or Node.js to utilize hardware resources effectively. They all need 1 process per hardware core. Where they differ is how the programmer ensures that control is released on blocking. With Python (and Ruby IIUC), control is released by core interpreter with out the programmer even knowing it is happening. With Node.js control is released by the programmer invoking a function that explicitly passes control back. The only thing a Python programmer has to ensure is that there are sufficient threads in the process for the PIL to pass control to when IO latencies are encountered, and that depends on the deployment mechanism which should be multi-threaded. The only added complication for the Node.js model is that the IO drivers need to ensure that every subsystem that performs blocking IO has some mechanism of storing state not bound to a thread (since there is only 1). A database transaction, for one request must not interact with that for another. This is no mean feat and I will guess (not having looked) is simular to the context switching process between native OS level threads. The only thing you cant do in Node.js is perform a compute intensive task without releasing control back to the event loop. Doing that stops a Node.js from serving any other requests. If you do that in Python, the interpreter suspends the pthread and reschedules after a set number of instructions. Proof, in some senses that multitasking is a foundation of the language rather than an artifact of the programmers code base.

The third language I mentioned is Java. Having spent most of my the last 16 years coding Java based apps I have enjoyed the freedom to be able to use every hardware core available from a single process all sharing the same heap. I have also suffered the misery of having to deal with interleaving IO, synchronization and avoiding blocking over shared resources. Java is unlike the other languages in this respect since it gives the programmer the tools and the responsibility to make best use of the hardware platform. Often that tempts the programmer to think they can be successful in eliminating all blocking IO by eliminating all non core memory IO. The reality is somewhat different, as no application that scales and connects humans together will ever have few enough connections between data to localise all the data used in a request to a single board of RAM. From my MPP years this was the domain decomposition bandwidth. It may be possible to eliminate IO from disk, but I have to doubt that a non trivial application can eliminate all backend network IO. In a sense, the threading model of Java tempts the developer to try and implement efficient hardware resource utilization, but doesn’t help them in doing so. The same can be said for many of lower level compiled languages. Fast and dangerous.

Don’t forget, with web applications, it’s IO that matters.

 





The trouble with Time Machine

9 05 2012

Every now and again Time Machine will spit out a “Cant perform backup, you must re-create your backup from scratch” or “Cant attach backup”.  For anyone who was relying on its rollback-time feature this is a reasonably depressing message and does typify modern operating systems, especially those of the closed source variety. At some point, having spent all the budget on pretty user interfaces, and catered for all use cases the deadline driven environment decides, “Aw stuffit we will just popup a catch all, your stuffed mate dialog box”. 99% of users, rant and rave and delete their backup starting again with a sense of injustice. If your reading this and have little or no technical knowledge, thats what you should do now.

If you get down to bare nuts and bolts you will find that a Time Machine backup is not that disimular to a BackupPC backup of 10 years ago. It makes extensive use of hard links to snapshot the state of the disk. It perform this in folders with thousands of files creating uniformly distributed tree. That all works fine except when it doesn’t. Anyone who has used hard links in anger on a file system will know it tends to put the file system under a lot of stress resulting in more filesystem corruptions than normal. File systems are not that transactional so if an operation fails part way through, then the hard links may start to generate orphaned links.

Now TimeMachine runs fsck_hsf when it attaches a sparse bundle file system which is the Time Machine backup. Unfortunately it doesn’t try that hard to fix any problems it finds and couldn’t possibly corrupt its pretty UI by telling the user that it might have a problem with the users cherished backup of life’s memories. Not good for marketing, loosing your loyal customers photos when you promised them it wouldn’t happen. Fortunately, those messages are logged in /var/log/fsck_hfs.log. If you use Time Machine and are finding the attach stage takes forever. Take a look in there for the words “FILESYSTEM DIRTY”. That indicates, that the last time Time Machine tried to attache the drive the file system check was unable to check the file system and correct any errors, and so, it marked it DIRTY. It is possible to correct one of these filesystems, however, with all those hard links the likelyhood is that your filesystem, even if fsck_hfs -dryf /dev/discXs1 does correct the errors and put it into a FILESYTEM CLEAN state, it wont be a usable and valid backup. When your laptop exits you house with a man wearing a stripy jumper and tights over his head, your children (and you) will cry realising that the backup in the cupboard is corrupt.

What advice can I give you?

  1. Check your backups regularly
  2. If you use TimeMachine, open the “console” program, type DIRTY into the search box and if you find that word, go out an buy another backup disk…. quick.

For those that want to try and recover a Time Machine backup.

chflags -R nouchg /Volumes/My\ Time\ Capsule/mylaptop.sparsebundle
hdiutil attach -nomount -noverify -verbose -noautofsck /Volumes/My\ Time\ Capsule/mylaptop.sparsebundle
tail -f /var/log/fsck_hfs.log
# If you see  "The Volume could not be repaired"
# then you need to run
fsck_hsf -dryf /dev/rdiskXs2
# where X was the number of disk listed when you hdutil attached.
# I can almost guarentee that the disk will not be recoverable and you will see tens of thousands
# of broken hard link chains. Fixing those will probably corrupt the backup.
# which is why this is futile.

If you are using a Time Capsule, power cycle it first, connect your machine to it of 1000BaseT and make sure no other machines are accessing it. Don’t use Wifi unless you want to grow old and die before the process completes.

 

Update

Perhaps I am being a little unfair here. The same unreliability could happen with any backup mechanism that is vulnerable to corrupted backups as a result of the user shutting the lid, the computer going to sleep, a power failure. Time Machine and Time Capsules weakness is that its all to easy to disconnect the network hard disk image and once you do that the Time Capsule end has no way of shutting down the back up process in a safe way. Do that enough times (I have found 1 is enough) and the backup is corrupt and unrecoverable and even the HFS+ Journal can’t recover.

I was also a bit unfair on BackupPC, which is initiated from the server and so although it may create nightmare file systems, can leave the backup image in a reasonable state when the server looses sight of the client.

Time Machine on an attached drive appears more reliable, but a lot less useful.





PyOAE renamed DjOAE

2 05 2012

I’ve been talking to several folks since my last post on PyOAE and it has become clear that the name doesn’t convey the right message. The questions often center around the production usage of a native Python webapp or the complexity of writing your own framework from scratch. To address this issue I have renamed PyOAE to DjOAE to reflect its true nature.

It is a DJango web application and the reason I chose DJango was because I didn’t want to write yet another framework. I could have chosen any framework, even a Java framework if such a thing existed, but I chose Django because it has good production experience with some large sites, a vibrant community and has already solved most of the problems that a framework should have solved.

The latest addition to that set of problems already solved, that I have needed is data and schema migration. DjOAE is intended to be deployed in a DevOps like way with hourly deployments  if needed. To make that viable the code base has to address schema and data migrations as they happen. I have started to use South that not only provides a framework for doing this, but automates roll forward and roll back of database schema and data (if possible). For the deployer the command is ever so simple.

python manage.py migrate

Which queries the database to work out where it is relative to the code and then upgrades it to match the code.

This formalizes the process that has been used for years in Sakai CLE into a third party component used by thousands and avoids the nightmare scenario where all data migration has to be worked out when a release is performed.

I have to apologise to anyone upstream for the name change as it will cause some disruption, but better now than later. Fortunately clones are simple to adjust, as git seems to only care about the commit sha1 so a simple edit to .git/config changing

url = ssh://git@bitbucket.org/ieb/pyoae.git
to
url = ssh://git@bitbucket.org/ieb/djoae.git

should be enough.

If you are the standard settings you will need to rename your database. I did this with pgAdminIII without dropping the database.





Vivo Harvester for Symplectic Elements

25 04 2012

I’ve been doing some work recently on a Vivo Harvester for Symplectic Elements. Vivo, as its website says, aims to build a global collaboration network for researchers. They say National, I think Global. The project hopes to do this by collecting everything know about a researcher including the links between researchers and publishing that to the web. You could say its like Facebook or G+ profiles for academics, except that the data is derived from reliable sources. Humans are notoriously unreliable when talking about themselves, and academics may be the worst offenders. That means that much of the information within a Vivo profile, and the Academic Graph that the underlying semantic web represents has been reviewed. Either because the links between individuals have been validated by co-authored research in peer reviewed journals, or because the source of the information has been checked and validated.

If I compare the Academic Graph(tm) (just claimed tm over that if no one else has) with Social Graphs like OpenSocial, then I suspect that OpenSocial covers less than 10% of ontology of an Academic Graph, and certainly a quick look at the base Vivo ontology reveals a few 100 top level object classes, each with 10s of data or object. That creates some chalenges for Vivo that have influenced its design. Unlike Twitter’s FlockDB with 1 property to each relationship, “follow”, this is a fully linked data set. Again Unlike Twitter there is no intention that a single Vivo instance, even cloud deployed, could host all researchers on a global scale. It seems that one of the standard RDF stores (Jena+SDB) is capable of at least holding the data and driving the user interface. I say holding, as since the early versions of Vivo it has used a Solr/Lucene index to provide query performance, since pure Jena queries would never keep place with all the queries required by a live application. This introduces a secondary characteristic of Vivo. The update cycle. Updates to Vivo, especially via a harvester require batched index re-builds, and may require re-computation of the RDF inferences. That places Vivo firmly in the infrequent update space which is precisely what Solr was designed for. A production instance of Vivo uses a RDF store  for URI resource reference into a a rich Achademic Graph of which some is exposed though the UI. That RDF store populates views in Solr from which much of the UI is derived. Solr becomes a rich data index.

A Vivo Harverster, of which the Symplectic Elements is just one of many, harvests data from trusted source and generates triples representing the data and its relationships. Some Harvesters, like the PubMed harvester perform that in their own ontology, whereas other Harvesters use the Vivo Ontology. The process of harvesting is to crawl the APIs of the target source recovering all information by following links. In the case of Symplectic Elements its API is an ATOM API, so parts of the Symplectic Harvester could be adapted to any ATOM based feed. The harvested information is converted into statements about each resource to represent knowledge. X is a person, Y is a peer reviewed publication, Z is a conference and A is a grant. Finally the huge bucket of statements is processed to identify statements about the same thing and compared to the existing Vivo model of knowledge. Eventually, when all the links have been matched and duplicates disambiguated that data can be ingested into the Vivo model.

With an actively managed source of information, like Symplectic Elements this is easier than it sounds since much of the disambiguation has already been done by staff at the University as part of their Research Assessment Exercise (RAE), however its graph of knowledge may still contain flailing ends, like external voluntary organisations known only by their title, hopefully no spelt in a 101 ways. Obviously Vivo can be fed by any trusted datasource including manual input.

What is the net result of this? The University running Symplectic Elements for their internal administrative processes (RAE in the UK) is able, using OpenSource software (Vivo and the Symplectic Harvester) to publish a public view of its own Academic Graph. If, as is the aim of the Vivo project and the reason for its original grant, a sufficient number of Universities and research institutions deploy public instances, then Semantic Web aggregators will be able to index the linked data within each Vivo instance to build a Global collaborative network of researchers, their interests, their funding streams and their discoveries. When Google acquired Freebase and the developer knowledge behind that company they acquired sufficient knowledge to do this on a global scale, I have heard rumors that that is what those individuals have been up to, which is why they went quiet.

That was why the Semantic Web was created for researchers in CERN all those years ago, wasn’t it ?





PyOAE

10 04 2012

For those that have been watching my G+ feed you will have noticed some videos being posted. Those vidoes are of the OAE 1.2 UI running on a server developed in Python using DJango. I am getting a increasing stream of questions about what it is, hence this blog post.

What is PyOAE?

PyOAE is a re-implementation of the OAE server using DJango and a fully relational database schema. I use PostgeSQL and/or SqlLite3 but it would probably wok on any RDBMS supported by DJango. The implementation uses the OAE 1.2.0 UI code (the yet to be released 1.2.0 branch) as its specification and runs that UI unmodified.

Is it a port of Nakamura to Python?

It is not a port of the Java code base called Nakamura that is the official server of OAE and shares no common code or concepts. It does not contain the SparseMap content system or a Python port of SparseMap.

When will it be released ?

It is not, yet, functionally complete and so is not ready for release. When it is, I will release it.

Will it scale ?

The short answer is I don’t have enough evidence to know at this stage. I have some circumstantial evidence and some hard evidence that suggests that there is no reason why a fully RDBMS model using DJango might not scale.

Circumstantial: DJango follows the same architectural model used by many LAMP based applications. Those have been shown to be amenable to scaling. WordPress, Wikipedia etc. In the educational space MoodleRooms scaled Moodle to 1M concurrent users with the help of MySQL and others.

Circumstantial: Sakai CLE uses relational storage and scales to the levels required to support the institutions that want to use it.

Circumstantial: DJango has been used for some large, high traffic websites.

Hard: I have loaded the database with 1M users and the response time shows no increase wrt to the number of users.

Hard: I have loaded the database with 50K users and 300K messages with no sign of increasing response time.

Hard: I have loaded the content store with 100K users and 200K content items and seen no increase in response time.

However, I have not load tested with 1000’s of concurrent users.

Will it support multi tenancy ?

Short answer is yes, in the same way that WordPress supports multi tenancy, but with the potentially to support it in other ways.

Is it complex to deploy ?

No, its simple and uses any of the deployment mechanisms recommended by DJango.

Does it use Solr?

At the moment it does not, although I havent implemented free text searching on the bodies of uploaded content. All queries are relational SQL generated by DJango’s ORM framework.

What areas does it cover?

Users, Groups, Content, Messaging, Activities, Connections are currently implemented and tested. Content Authoring  is partially supported and I have yet to look at supporting World. Its about 7K lines of python including comments covering about 160 REST endpoints with all UI content being served directly from disk. There are about 40 tables in the database.

What areas have you made different ?

Other than the obvious use of an RDBMS for storage there are 2 major differences that jumps out. The content system is single instance. ie if you upload the same content item 100 times, its only stored once and referenced 100 times. The second is that the usernames are not visible in any http response from the server making it impossible to harvest usernames from an OAE instance. This version uses FERPA safe opaque IDs. Due to some hard coded parts of the UI I have not been able to do the same for Group names which can be harvested.

Could a NoSQL store be used ?

If DJango has an ORM adapter for the NoSQL store, then yes although I havent tried, and I am not convinced its necessary for the volume of records I am seeing.

Could the same be done in Java?

Probably, but the pace of development would be considerably less (DJango development follows a edit-save-refresh pattern and requires far fewer lines of code than Java). Also, a mature framework that had solved all the problems DJango has addressed would be needed to avoid the curse of all Java developers, the temptation to write their own framework.

Who is developing it?

I have been developing it in my spare time over the past 6 weeks. It represents abut 10 days work so far.

Why did you start doing it?

In January 2012 I heard that several people whose opinion I respected had said Nakamura should be scrapped and re-written in favour of a fully relational model I wanted to find out if they were correct. So far I have found nothing that says they are wrong. My only regret is that I didn’t hear them earlier and I wish I had tried this in October 2010.

Why DJango ?

I didn’t want to write yet another framework, and I don’t have enough free time to do this in Java. DJango has proved to have everything I need, and Python has proved to be quick to develop and plenty fast enough to run OAE.

Didn’t you try GAE ?

I did try writing a Google App Engine backend for OAE, however I sound found two problems. The UI requirements for counting and querying relating to Groups were incompatible with BigStore. I also calculated a GAE hosted instance would rapidly breach the free GAE hosting limits due to the volume of UI requests. Those two factors combined to cause me to abandon a GAE based backend.





Flashback

6 04 2012

The world wakes up to an OSX virus. News media jumps on the story terrifying users that they might be infected. Even though the malware that users were tricked to install may not be nice its clear from looking at the removal procedure that unlike  Windows platform were a virus normally buries itself deep within the inner workings of the OS, this trojan simply modified an XML file on disk and hence reveals its location. To be successful in doing that it would have had to persuade the user to give it elevated privileges as the file is only writable by the root user. If it failed to do that it would have infected user space.

In spite of all the hype around this infection, the root of infection shows that the underlying OS is, in it self, secure and so only as secure as the user who grants and installer elevated privileges. If, when you install software on you Mac you are not prompted for a administrative password, go and find out why before something else quietly installs itself and steals your bank details.

The files involved are the plist file (Info.plist) for any browser, so when you look to see if you have been infected, dont forget to check all browsers you use, not just Safari and Firefox. Also check Chrome.

If you are wondering if other plists are secure, many are cryptographically signed with a private key belonging to Apple. Provided that key doesn’t leak undetected those plists cant easily be compromised. For anyone who is paranoid, the standard Unix tools like tripwire would protect any unsigned plists.





Modern WebApps

12 03 2012

Modern web apps. like it or not, are going to make use of things like WebSockets. Browser support is already present and UX designers will start requiring that UI implementations get data from the server in real time. Polling is not a viable solution for real deployment since at a network level it will cause the endless transfer of useless data to and from the server. Each request asking every time, “what happened ?” and the server dutifully responding “like I said last time, nothing”. Even with minimal response sizes, every request comes with headers that will eat network capacity. Moving away from the polling model will be easy for UI developers working mostly in client and creating tempting UIs for a handfull of users. Those attractive UIs generate demand and soon the handfull of users become hundreds or thousands. In the past we were able to simply scale up the web server, turn keep alives off, distribute static content and tune the hell out of each critical request. As WebSockets become more wide spread, that won’t be possible. The problem here is that web servers have been built for the last 10 years on a thread per request model, and many supporting libraries share that assumption. In the polling world that’s fine, since the request gets bound to the thread, the response is generated as fast as possible, and the the thread is unbound. Provided the response time is low enough the request throughput of the sever will be maintained high enough to service all requests without exausting the OS’s ability to manage threads/processes/open resources.

Serving a WebSocket request with the same model is a problem. The request is bound to a thread, the response is not generated  as it waits, mid request, pending some external event. Some time later, that event happens and the response is delivered back to the client. The traditional web server environment will have to expect to be able to support as many concurrent requests on your infrastructure as there are users who have a page pointing to your sever on one of the many tabs they have open. If you have 100K users with a browser window open on a page where you have a WebSocket connection, then the hosting infrastructure will need to support 100K in progress requests. If the webserver model is process per request, somehow you have to provide resources to support 100K OS level processes. If its thread per request, then 100K threads. Obviously the only way of supporting this level of idle but connected requests is to use an event processing model. But that creates problems.

For instance, anyone writing PHP code will know it will probably on only run in process per worker mode as many of the PHP extensions are not thread safe. Java servlets are simular although changes in the Servlet 3 spec have constructs to release the processing thread back to the container, although many applications are still being developed on Servlet 2.4 and 2.5, and most frameworks are not capable of suspending requests. Python using mod_wsgi doesn’t have a well defined way of releasing the processing thread back to the server although there is some code originating from Google that uses mod_python to manipulate the connection and release the thread back to Apache Httpd.

There are new frameworks (eg Node.js) that address this problem and there is a considerable amount of religion surrounding their use. The believers able to show unbelievable performance levels on benchmark test cases and the doubters able to point to unbelievably complex and unfathomable real application code. There are plenty of other approaches to the same problem that avoid spagetti code, but the fundamental message is, that to support WebSockets at the server side an event based processing model has to be used, that is the direct opposite to how web applications have been delivered to date, and regardless of the religion, that creates a problem for deployment.

Deployment of this type of application demands that WebSocket connections are can be unbound from the thread servicing the request, when it becomes a WebSocket connection. The nasty twist is that every box handling the request needs to be able to do that, including any WebTiers or load balancers, and any HTTP connection can be converted from the Http protocol into the WebSocket protocol during the request. Fortunately, sensible applications will only support WebSocket on known URLs which gives the LB and WebTiers an oppertunity to route, but prior to routing every component in the chain must be using a small number of threads servicing a large number of open and active sockets.

This doesn’t mean that an entire application framework must be thrown away, but it does mean that whatever is handling the WebSocket request, upgrade and eventual connection must be event based. This also doesn’t mean that everyone must learn how to read and write spaghetti code in managing every aspect of threading threading, concurrency in communication re-writing every library to be non-blocking and asynchronous. Fortunately there are some extremely capable epoll based containers (including Node.js, other than its insistance to use JS) that can be used either as WebTier proxies or ultimate endpoints. Some of them, such as the Python based Tornado server will frameworks supporting the mod_wsgi standard and hence capable of running Django based applications for the non WebSocket portion. As can be seen from real benchmarks, these servers offer performance level expected of event based processing and support for traditional frameworks with real blocking resource connections.





Deploying/Tuning SparseMap Correctly: Update

1 03 2012

In the last post I reported the impact of using SparseMap with caching disabled, but at the same time noticed that there was a error in the way in which JDBC connections where handled. SparseMap doesn’t pool JDBC connections. It makes the assumption that anyone using SparseMap will read the Session API and note that Sessions should only be used on one thread during their active life time. Although the SparseMap Session Impl is thread safe, doing that eliminates all blocking synchronization and concurrent locks. If the Session API use followed, then JDBC connections can be taken out of the pool and bound to threads which ensures that only 1 thread will ever access a JDBC connection at any one time. When bound to threads, JDBC sessions can be long lived and so are only revalidated if they have been idle for some time. If any connection is found to be in a error state its replaced with a fresh connection.

In Sakai OAE 1.2, with SparseMap caching inadvertently disabled, and JDBC connections being validated on every request, there were about 500,000 SQL operations in the short load test.  With those issues addressed, the number of SQL operations drops to 6600, removing almost 1000s (over 15 minutes) from the 1h load test execution time and removing JDBC entirely from the list of JVM hot spots. Notice that in the last hotspot SparseMap is showing almost no time being blocked by sync operations, although there is more time spent suspended than I would like which needs to be investigated. I cannot stress how important it is to make certain that caching is working properly if you are using SparseMap.  Here are the results, which I cant take any credit for. The Load testing was performed by Jon Cook at Indiana University, full details can be found at https://uisapp2.iu.edu/confluence-prd/display/ONC/OAE+Evaluation+Load+Test+Results

SQL Profile Before Caching was enabled correctly.

SQL Profile After Caching was enabled correctly

JVM Hotspots before Caching was enabled correctly





Deploying/Tuning SparseMap Correctly

26 02 2012

Deploying

SparseMap is designed to maintain a shared cache of recently accessed maps in memory. The code base itself is also designed to use a little memory as possible. The SparseMap app server runs happily at load in 20MB of heap. Sakai OAE which is the main user of SparseMap uses a little more than that (around 200MB) leaving the remainder of heap available for caching. If caching is working correctly, and there is sufficient heap available for the number of active users the profile of calls to the storage layer should show almost no reads and a low level of writes. If however a mistake is made then the impact is dramatic. The first trace here shows Sakai OAE as of 2nd Feb 2012 running SparseMap 1.3 with a missconfigured cache setup. The image shows the SQL report from Neoload.

You can see that there a colossal number of SQL statements performing a query on parent hash and there are also massive number of other queries. Obviously something is not right.

Compare that with Sakai OAE on 23 Feb 2012 running SparseMap 1.5 with caching configured

The query profile has completely changed with almost everything being served from cache in this test. The 282189 queries taking 577s for parenthash has become 325 queries taking 0.645s The message here is, dont deploy SparseMap without caching enabled, and check that it is enabled and sized correctly. There are periodic log statements coming from SparseMap will indicate the performance of the cache which should always be running at over 80% hit rate.

Tuning

SparseMap comes with a default configuration for SQL and DDL. It may be perfectly OK for most installations never needing any tuning, but the design and implementation assumed that deployers would tune both the DDL and SQL.

Tuning DDL

The DDL that comes with the RDBMS drivers is a default SQL schema. It makes the assumption that the deployment is going to be small to medium in size and probably never see more than 1M content items. If after sizing a production deployment its clear that the application will contain more than 1M items then some tuning of this DDL must be done. How much depends on how big the installation will be. The internal structure of SparseMap was designed to use database shards in the same way that YouTube’s metdata store does. The sharding is performed on the first 2 characters of the row key giving a theoretical maximum number of shards of 64^^2, although the configuration file will become unmanageable with that many shards.

Even if sharding is not required, the indexing operations within SparseMap will need tuning. If SparseMap should only be configured to index the column values that the application needs to index. By default there is only a single very wide indexing table which can become extremely inefficient. Its columns are by default large varchars and for many situations this will be very slow and wasteful. Once the client application knows what its indexing tables need to look like it should create those tables before SparseMap starts up so that each (yes there supposed to be more than one index table) table is dense and efficient with  properties that are queries and indexed by the same use case living in the same table. If the application has chosen to use non real time querying (eg by using Solr), then it should ensure that SparseMap is not unnecessarily indexing data that it will never use.

Tuning SQL

One of the main requirements for SparseMap was that it would allow UI devs to store and query any unstructured data without having to write a line of Java code. Consequently the queries it generates are not the most efficient. The design always assumed that deployers would at some time need to tune the SQL the application was hitting the database with, and they would also want to do that without touching a line of Java code. All the SQL is in property files to allow deployers to tune in production.

To assist a deployer in tuning, users of SparseMap should name all their queries by adding setting the property “_statementset” on each query to a different name. If that is done, then the deployer can bind customised tuned queries to each value of _statementset. The binding of queries also takes account of sharding if the storage layer has been sharded.

This only gives an introduction to deploying and tuning SparseMap. I would be astounded if SparseMap was so elegant that it could be deployed into production without ensuring caching was configured correctly, the DDL was adjusted and the SQL was tuned were appropriate.

For the observant amongst you the NeoLoad SQL reports have revealed a rather obvious bug that needs attention. SQL connections are bound to threads not to the pooled storage client implementation. When the storage client implementation is borrowed from the pool, the SQL connection associated with the thread performing the borrow operation is validated, in this case, Oracle, select 1 from dual is executed. Since the connection is thread bound this is unnecessary and accounts 496s or the 509s consumed by the load test. I intend to remove most of that in the next release as the validation approach is incorrect, and does not protect against SQL failure mid request. Correctly configured caching did, thankfully, reduce the SQL portion of this load test from 1064s down to 509s, although I think it should be possible to reduce that to arround 120s. At which point the load test will need to be upgraded.

 

 

Update: 29 Feb 2012

The high levels of select 1 from dual have been eliminated. Duffy Gillman from rSmart did the bulk of the work a few months ago in SparseMap 1.5 however he/we/I failed to remove all the places where connections were verified unnecessarily. The issue was fixed with commit  778a0bfe97963dccf46566a431853bab6f7c87cc which is available in the master branch of SparseMap for Sakai OAE to merge into their fork of the codebase. Hopefully, the improvement will show up in the next round of load testing, and should also remove JDBC as the top hotspot.





OpenID HTML and HTMLYadis Discovery parser for OpenID4Java

24 02 2012

OpenID4Java is great library for doing OpenID and OAuth. Step2 will probably be better but its not released. Unfortunately the HTML and HTMLYadis parsers rely on parsing the full HTML document and pull in a large number of libraries. These include things like Xerces and Resolver wich can cause problems if running multiple versions in the same JVM under OSGi. For anyone else wanting to eliminate dependencies here are Regex based parsers that have no dependencies outside code OpenID4Java and JRE.

package uk.co.tfd.sm.authn.openid;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.openid4java.discovery.yadis.YadisException;
import org.openid4java.discovery.yadis.YadisHtmlParser;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class HTMLYadisDiscoveryParser implements YadisHtmlParser {

    private static final Logger LOGGER = LoggerFactory.getLogger(HTMLYadisDiscoveryParser.class);
    @Override
    public String getHtmlMeta(String input) throws YadisException {
        Pattern head = Pattern.compile("\\<head.*?\\</head\\>",Pattern.CASE_INSENSITIVE | Pattern.DOTALL );
        Pattern meta = Pattern.compile("\\<meta.*?http-equiv=\"X-XRDS-Location\".*?\\>", Pattern.CASE_INSENSITIVE| Pattern.DOTALL);
        Pattern url = Pattern.compile("content=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
        Matcher headMatch = head.matcher(input);
        if ( headMatch.find() ) {
            Matcher metaMatcher = meta.matcher(headMatch.group());
            while( metaMatcher.find()) {                
                Matcher urlMatcher = url.matcher(metaMatcher.group());
                if ( urlMatcher.find() ) {
                    return urlMatcher.group(1);
                }
            } 
        } else {
            LOGGER.info("No head found in {} ", input);
        }
        return null;
    }
}

package uk.co.tfd.sm.authn.openid;

import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.lang.StringUtils;
import org.openid4java.OpenIDException;
import org.openid4java.discovery.DiscoveryException;
import org.openid4java.discovery.html.HtmlParser;
import org.openid4java.discovery.html.HtmlResult; 
import com.google.common.collect.ImmutableSet;
 public class HTMLDiscoveryParser implements HtmlParser {
    @Override
   public void parseHtml(String htmlData, HtmlResult result) throws DiscoveryException {
        Pattern head = Pattern.compile("\\<head.*?\\</head\\>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
        Pattern link = Pattern.compile("\\<link.*?\\>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
        Pattern linkRel = Pattern.compile("rel=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
        Pattern linkHref = Pattern.compile("href=\"(.*?)\"", Pattern.CASE_INSENSITIVE);
        Matcher headMatch = head.matcher(htmlData);
        if (headMatch.find()) {
            Matcher linkMatcher = link.matcher(headMatch.group());
            while (linkMatcher.find()) {
                String linkTag = linkMatcher.group();
                Matcher linkRelMatch = linkRel.matcher(linkTag);
                if (linkRelMatch.find()) {
                    Matcher linkHrefMatcher = linkHref.matcher(linkTag);
                    if (linkHrefMatcher.find()) {
                        String href = linkHrefMatcher.group(1);
                        Set<String> terms = ImmutableSet.copyOf(StringUtils.split(linkRelMatch.group(1), " "));
                        if (terms.contains("openid.server")) {
                            if (result.getOP1Endpoint() != null) {
                                throw new DiscoveryException("More than one openid.server entries found",
                                        OpenIDException.DISCOVERY_HTML_PARSE_ERROR);
                            }
                            result.setEndpoint1(href);
                        }
                        if (terms.contains("openid.delegate")) {
                            if (result.getDelegate1() != null) {
                                throw new DiscoveryException("More than one openid.delegate entries found",
                                        OpenIDException.DISCOVERY_HTML_PARSE_ERROR);
                            }
                            result.setDelegate1(href);
                        }
                        if (terms.contains("openid2.provider")) {
                            if (result.getOP2Endpoint() != null) {
                                throw new DiscoveryException("More than one openid.server entries found",
                                        OpenIDException.DISCOVERY_HTML_PARSE_ERROR);
                            }
                            result.setEndpoint2(href);
                        }
                        if (terms.contains("openid2.local_id")) {
                            if (result.getDelegate2() != null) {
                                throw new DiscoveryException("More than one openid2.local_id entries found",
                                        OpenIDException.DISCOVERY_HTML_PARSE_ERROR);
                            }
                            result.setDelegate2(href);
                        }
                    }
                }
            }
        }
    }
}