<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>T F D</title>
	<atom:link href="http://blog.tfd.co.uk/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.tfd.co.uk</link>
	<description>Open Source Open Thought</description>
	<lastBuildDate>Thu, 16 May 2013 08:27:20 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='blog.tfd.co.uk' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://0.gravatar.com/blavatar/493b8cbeb34ea6b3296a64c26bce7e4a?s=96&#038;d=http%3A%2F%2Fs2.wp.com%2Fi%2Fbuttonw-com.png</url>
		<title>T F D</title>
		<link>http://blog.tfd.co.uk</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blog.tfd.co.uk/osd.xml" title="T F D" />
	<atom:link rel='hub' href='http://blog.tfd.co.uk/?pushpress=hub'/>
		<item>
		<title>HowTo: Make your MacBookPro feel like new again.</title>
		<link>http://blog.tfd.co.uk/2013/03/01/howto-make-you-macbookpro-feel-like-new-again/</link>
		<comments>http://blog.tfd.co.uk/2013/03/01/howto-make-you-macbookpro-feel-like-new-again/#comments</comments>
		<pubDate>Fri, 01 Mar 2013 00:50:54 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=840</guid>
		<description><![CDATA[Most computers have inbuilt obsolescence that&#8217;s fundamental to the way they were created. When the chips were made they semiconductor capability was implanted by doping areas of the silicon with atoms to change the electrical behaviour. Often performed by diffusing those atoms at a higher than normal operating temperature. Once in server, over time the atoms continue to [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=840&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Most computers have inbuilt obsolescence that&#8217;s fundamental to the way they were created. When the chips were made they semiconductor capability was implanted by doping areas of the silicon with atoms to change the electrical behaviour. Often performed by diffusing those atoms at a higher than normal operating temperature. Once in server, over time the atoms continue to diffuse and eventually they diffuse enough to cause the semiconductor to fail. The CPU stops functioning, or the memory chip becomes&#8230; oh whats the word&#8230; forgetful.</p>
<p>Under normal conditions this takes a long time, and the older the chip the longer it takes. Older chips, from the 1980s were built on a huge feature scale relative to todays silicon giving atoms extended journeys to complete before diffusion did its damage. 2 years ago I replaces a marine auto pilot, installed in 1984 with a failed 16MHz micro controller the size of a football field. It had survived almost 20 years in a black box in the sun before the doping atoms completed their journey.</p>
<p>Sitting on the train with my legs being slowly roasted by a hot MackBook Pro I realised something wasnt right. I was only reading a PDF with both CPUs at 1% and the fans whirling like dervishes trying in vain to keep the CPU temp below 85C. My legs would recover but I like my MBP and even though it&#8217;s becoming old its slow(er) processor and lack of RAM makes me write faster code. I don&#8217;t really want those atoms to finish their hop skip and a jump of a journey ending the life of the CPU. At 85C I&#8217;ll bet they are hopping all over the place.</p>
<div id="attachment_842" class="wp-caption alignright" style="width: 122px"><a href="http://ianboston.files.wordpress.com/2013/03/photo-3.jpg"><img class="size-thumbnail wp-image-842 " alt="After" src="http://ianboston.files.wordpress.com/2013/03/photo-3.jpg?w=112&#038;h=150" width="112" height="150" /></a><p class="wp-caption-text">After</p></div>
<div id="attachment_843" class="wp-caption alignright" style="width: 122px"><a href="http://ianboston.files.wordpress.com/2013/03/photo-2.jpg"><img class="size-thumbnail wp-image-843 " alt="Before" src="http://ianboston.files.wordpress.com/2013/03/photo-2.jpg?w=112&#038;h=150" width="112" height="150" /></a><p class="wp-caption-text">Before</p></div>
<p>On opening the back I discovered the fan exhausts were clogged with fluff. After cleaning the fans are hardly spinning and the CPU temperatures are well below 50C most of the time. Unintentionally, Apple have added inbuilt obsolescence to their laptops. As you use your MBP it will get hot. The fans will pull in dust even in the most sterile office and home environments and they will eventually block. The silicon components will run hotter than they should, the dopant atoms will be hopping, finish their diffusion journey and your digital life will be in the bin sooner than it should.</p>
<p>Having cleaned the fans, my MBP feels like when it was new and cool&#8230;. or am I just getting old and forgetful, where is my fan?</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=840&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2013/03/01/howto-make-you-macbookpro-feel-like-new-again/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>

		<media:content url="http://ianboston.files.wordpress.com/2013/03/photo-3.jpg?w=112" medium="image">
			<media:title type="html">After</media:title>
		</media:content>

		<media:content url="http://ianboston.files.wordpress.com/2013/03/photo-2.jpg?w=112" medium="image">
			<media:title type="html">Before</media:title>
		</media:content>
	</item>
		<item>
		<title>Java, Vulnerabilities and FUD</title>
		<link>http://blog.tfd.co.uk/2013/01/13/java-vulnerabilities-and-fud/</link>
		<comments>http://blog.tfd.co.uk/2013/01/13/java-vulnerabilities-and-fud/#comments</comments>
		<pubDate>Sun, 13 Jan 2013 08:47:55 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Applet]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Java Applet]]></category>
		<category><![CDATA[Languages]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Security]]></category>
		<category><![CDATA[Web browser]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=836</guid>
		<description><![CDATA[There are plenty of people in the IT industry that would like nothing better than for Java never to have existed. The current vulnerability is being swooped by anyone with an agenda to feed the media with FUD. The media, not knowing any better, is dutifully reporting the information they are given. What are the facts ? [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=836&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>There are plenty of people in the IT industry that would like nothing better than for Java never to have existed. The current vulnerability is being swooped by anyone with an agenda to feed the media with FUD. The media, not knowing any better, is dutifully reporting the information they are given. What are the facts ?</p>
<p>1. The vulnerability is only really significant for Java running inside a web browser using the unsigned Java Applet mechanism, which accounts for 0.1% of the usage of Java.</p>
<p>2. No one (since about 2001) should ever have support for <a class="zem_slink" title="Java applet" href="http://en.wikipedia.org/wiki/Java_applet" target="_blank" rel="wikipedia">Java Applets</a> turned on in a Browser, and no <a class="zem_slink" title="Web developer" href="http://en.wikipedia.org/wiki/Web_developer" target="_blank" rel="wikipedia">web developer</a> since should deploy a Java Applet.</p>
<p>Statements from &#8220;Security Experts&#8221; like &#8220;java security is a mess&#8221; are true only in the context of running Java in a <a class="zem_slink" title="Web browser" href="http://en.wikipedia.org/wiki/Web_browser" target="_blank" rel="wikipedia">Web Browser</a>, and should be qualified by:  Running any <a class="zem_slink" title="Machine code" href="http://en.wikipedia.org/wiki/Machine_code" target="_blank" rel="wikipedia">native code</a>, downloaded from the internet on an Operating system that has no real intrinsic security is suicidal. It&#8217;s not really Java that&#8217;s the problem, the problem is <a class="zem_slink" title="Operating system" href="http://en.wikipedia.org/wiki/Operating_system" target="_blank" rel="wikipedia">Operating systems</a> that allow a web browser to run in an environment where it can make fundamental changes to the core aspects of Operating systems. Just as you would never download and run a untrusted native executable as the <a class="zem_slink" title="Superuser" href="http://en.wikipedia.org/wiki/Superuser" target="_blank" rel="wikipedia">root user</a>, or even an &#8220;Administrator&#8221; you only have yourself to blame if click &#8220;Ok&#8221; when asked the question &#8220;Do you mind if I run this untrusted code &#8230;. that will steal your identity, empty your bank account, sell your house and destroy your life&#8221;. Don&#8217;t blame a language (any language), blame the browser or the OS or yourself.</p>
<p>Applets should have been deprecated in 2001. They were relevant when browsers were incapable, but have been superseded since. This vulnerability has nothing to do with 99.9% of Java usage. It&#8217;s a pity, but not unsurprising that this message will never reach mainstream media. Long live FUD.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=836&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2013/01/13/java-vulnerabilities-and-fud/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>
	</item>
		<item>
		<title>Scaling streaming from a threaded app server</title>
		<link>http://blog.tfd.co.uk/2013/01/03/scaling-streaming-from-a-threaded-app-server/</link>
		<comments>http://blog.tfd.co.uk/2013/01/03/scaling-streaming-from-a-threaded-app-server/#comments</comments>
		<pubDate>Thu, 03 Jan 2013 07:43:44 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Apache]]></category>
		<category><![CDATA[Apache HTTP Server]]></category>
		<category><![CDATA[Application server]]></category>
		<category><![CDATA[DSpace]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=833</guid>
		<description><![CDATA[One of the criticisms that is often leveled against threaded servers where a thread or process is bound to a request for the lifetime of that request, is that they don&#8217;t scale when presented with a classical web scalability problem. In many applications the criticism is justified, not because the architecture is at fault, but often because some fundamental rules of implementation have [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=833&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>One of the criticisms that is often leveled against threaded servers where a thread or process is bound to a request for the lifetime of that request, is that they don&#8217;t scale when presented with a classical web scalability problem. In many applications the criticism is justified, not because the architecture is at fault, but often because some fundamental rules of implementation have been broken. Threaded servers are good at serving requests, where the application thread has to be bound to the request for the shortest possible time, and while it is bound, no IO waits are encountered. If that rule is adhered to, then some necessities of reliable web applications are relatively trivial to achieve and the server will be capable of delivering throughput that saturates all the resources of the hardware. Unfortunately, all to often application developer often break that rule and think the only solution has to be to use a much more complex environment that requires event based programming to interleave the IO wait states of thousands of in progress requests. In the process they dispose of transactions, since the storage system they were using (a RDBMS) can&#8217;t possibly manage several hundred thousand in progress transactions even if there was sufficient memory on the app server to manage the resources associated with each request to transaction mapping&#8230;. unless they have an infinite hardware budget and there was no such thing as physics.</p>
<p>A typical situation where this happens is where large files are streamed to users over slow connections. The typical web application implementation spins up a thread, that performs some queries to validate ACLs on the item, perhaps via SQL or via some in memory structured. Once the request if validated that thread, with all its baggage and resources laboriously copies blocks of bytes out to the client while keeping the thread associated with the request. The request to thread association is essentially long lived. If the connector managing the http connection knows about keep alives, it might release the thread to connection association at the end of the response, but it can&#8217;t do that until the response is complete. So a typical application serving large files to users will rapidly run out of spare threads giving threaded servers a bad name. That&#8217;s bad in so many ways. Trickled responses can&#8217;t be cached, so they have to be regenerated every time. The application runs like a dog, because a tiny part of its behaviour is always a resource hog. Anyone deploying in production will find simple DOS&#8217;s are easy to execute by just holding down the refresh button on a browser.</p>
<p>It doesn&#8217;t have to be like that. The time taken for the application to process the request and send the very first byte should be no greater than any other request processed by the application. Most Java based applications can get that response time below 10ms and responses below 1ms are no to hard on modern hardware with a well structured application. To do this with a streamed body is relatively simple. Validate the request, generate a response header in the threaded application server that instructs the connector handling the front end http connection to deliver content from an internal location. Commit the response with no body, and detach the thread servicing the request from the request freeing it to service the next request. Since if implemented efficiently, there were hardly any IO waits involved in that operation, the potential for a thread or CPU core to do other processing while waiting for IO is reduced.</p>
<p>If the bitstream to be send is stored as a file, then you can use X-Sendfile originating LiteHttpd, with close implementations in  Apache Httpd (<a href="https://tn123.org/mod_xsendfile/">mod_xsendfile</a>),  nginx ( <a href="http://wiki.nginx.org/XSendfile">X-Accel-Redirect</a>). If the file is stored at a remote httpd location then some other delivery mechanism can be used. Obviously the http connector (any of the above) should be configured to handle a long lived connection delivering bytes slowly.</p>
<p>In the blog post prior to this I mentioned that <a class="zem_slink" title="DSpace" href="http://www.dspace.org/" target="_blank" rel="homepage">DSpace</a> 3 could be made to serve public content via a cache exposing literally thousands of assets to slow download. I am using this approach to ensure that the back end DSpace server does not get involved with streaming content which might small PDFs but could just as well be multi GB video files or research datasets. The assets in DSpace have been stored on a mountable file system allowing a front end http server to deliver the content without reference to the application server. I have used the following snippets to set and commit the response headers after ACLs have been processed. I also deliver such content have a HMAC secured redirect to ensure that user submitted content into the Digital Repository can&#8217;t maliciously steal administrative sessions. Generation of HMAC secured redirect takes in the region of 50ms during which time resources are dedicated. If the target is public, the redirect pointer may be cached. Conversion of HMAC secured redirect into X-Sendfile header takes in the region of 1ms with no requirement for database access. Serving the bitstream itself introduces IO waits, but the redirects cant be sent to simple evented httpd servers in a farm. If all the app server is doing is processing the HMAC secured redirects then a few 100 threads at 1ms per request can handle significant traffic in the app server layer. I&#8217;ll leave you to do the math.</p>
<p>The same technique could be used for any long lived httpd request, eliminating the need to use an evented application server stack and abandon transactions. Obviously, if your application server code has become so complex the non streaming requests are taking so long they are limiting throughput, then this isn&#8217;t going to help.</p>
<p>For Apache mod_xsendfile:</p>
<pre>protected void doSendFile(String path, Meta meta, HttpServletResponse response) {
  response.setHeader("X-Sendfile", assetStoreBase+path);
  response.setHeader("Content-Type", (String) meta.get("content-type"));
  if ( meta.has("filename")) {
     response.setHeader("Content-Disposition", "attachment; filename="+meta.get("filename"));
  }
  // thats it, response can be committed.
}


</pre>
<p>For nginx:</p>
<p>&nbsp;</p>
<pre>protected void doSendFile(String path, Meta meta, HttpServletResponse response) {
    response.setHeader("X-Accel-Redirect", assetStoreBase+path);
    response.setHeader("X-Accel-Buffering",buffering);
    response.setHeader("X-Accel-Limit-Rate",rateLimit);
    response.setHeader("X-Accel-Expires",cacheExpires);
    response.setHeader("Content-Type", (String) meta.get("content-type"));
    if ( meta.has("filename")) {
        response.setHeader("Content-Disposition", "attachment; filename="+meta.get("filename"));
    }
}


</pre>
<p>For LiteHttpd:</p>
<p>&nbsp;</p>
<pre>protected void doSendFile(String path, Meta meta, HttpServletResponse response) {
   response.setHeader("X-LIGHTTPD-send-file", assetStoreBase+path);
   response.setHeader("Content-Type", (String) meta.get("content-type"));
   if ( meta.has("filename")) {
       response.setHeader("Content-Disposition", "attachment; filename="+meta.get("filename"));
   }
}</pre>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=833&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2013/01/03/scaling-streaming-from-a-threaded-app-server/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>
	</item>
		<item>
		<title>Making the Digital Repository at Cambridge Fast(er)</title>
		<link>http://blog.tfd.co.uk/2012/12/18/making-the-digital-repository-at-cambridge-faster/</link>
		<comments>http://blog.tfd.co.uk/2012/12/18/making-the-digital-repository-at-cambridge-faster/#comments</comments>
		<pubDate>Tue, 18 Dec 2012 01:45:48 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Apache HTTP Server]]></category>
		<category><![CDATA[Apache HTTPD]]></category>
		<category><![CDATA[Apache Solr]]></category>
		<category><![CDATA[Digital Library]]></category>
		<category><![CDATA[DSpace]]></category>
		<category><![CDATA[Dspace 3]]></category>
		<category><![CDATA[Open Access]]></category>
		<category><![CDATA[Uniform Resource Locator]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=824</guid>
		<description><![CDATA[For the past month or so I have been working on upgrading the Digital Repository at the University of Cambridge Library from a heavily customised version of DSpace 1.6 to a minimally customised version of DSpace 3. The local customizations were deemed necessary to achieve the performance required to host the 217,000 items and the 4M metadata records in [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=824&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>For the past month or so I have been working on upgrading the Digital Repository at the University of Cambridge Library from a heavily customised version of <a class="zem_slink" title="DSpace" href="http://en.wikipedia.org/wiki/DSpace" target="_blank" rel="wikipedia">DSpace</a> 1.6 to a minimally customised version of DSpace 3. The local customizations were deemed necessary</p>
<div class="wp-caption alignright" style="width: 130px"><a href="http://www.dspace.cam.ac.uk/handle/1810/233711"><img alt="" src="http://www.dspace.cam.ac.uk/thumbnail/564511/TS-AS-155-155-F.jpg" width="120" height="150" /></a><p class="wp-caption-text">List of commandments from Lev. 9–12; numbered Halakhot 8–18; includes a list of where they appear in Maimonides&#8217; Book of Commandments and the Mishneh Torah, and possibly references to another work.</p></div>
<p>to achieve the performance required to host the 217,000 items and the 4M metadata records in the Digital Repository. DSpace 3 which was releases at the end of November 2012 showed promise in removing the need for many of the local patches. I am happy to report that this has proved to be the case and we are now able to cover all of our local use cases using a stock DSpace release with local customizations and optimizations isolated into an overlay. One problem however remains, performance.</p>
<p>The current Digital Repository contains detailed metadata and is focused on preservation of artifacts. Unlike the more popular <a href="http://cudl.lib.cam.ac.uk/">Digital Library</a> which has generated significant <a href="http://www.bbc.co.uk/news/uk-england-cambridgeshire-20711692">media interest </a>in recent weeks with items like &#8220;A 2,000-year-old copy of the 10 Commandments&#8221; , the Digital Repository does not yet have significant traffic. That may change in the next few months as the UK government is taking a lead in the <a href="http://www.guardian.co.uk/science/2012/jul/25/uk-government-open-access-development-research">Open Access agenda</a> which may prompt the rest of the world to follow. Cambridge, with its leading position in global research will be depositing its output into its Digital Repository. Hence, a primary concern of the upgrade process was to ensure that the Digital Repository could handle the expected increase in traffic driven by Open Access.</p>
<h2>Some basics</h2>
<p>DSpace is a Java web application running in Tomcat. Testing Tomcat for a trivial application reveals that it will deliver content at a peak rate of anything up to 6K pages per second. If that rate were sustained for 24h, 518M pages would have been served. Unfortunately traffic is never evenly distributed and applications always add overhead but this gives an idea of the basics. At 1K pages/s 86M pages would be served in 24h. Many real Java webapps are capable of jogging along happily at that rate. Unfortunately DSpace is not. It&#8217;s an old code base that has focused on the preservation use case. Many page requests perform heavy database access and the flexible Cocoon based XMLUI  is resource intensive. The modified 1.6 instance using a JSP UI delivers pages at 8/s on a moderate 8 core box and the unmodified DSpace 3, using the XMLUI instance a 15/s on a moderate 4 core box. Surprisingly, because the application does not have any web 2.0 functionality to speak of, even at that low level it feels quite nippy as each page is a single request once the page assets (css/js/png etc) are distributed and cached. With the Cambridge Digital Library regularly serving 1M pages per hour, Open Access on the Digital Repository at Cambridge will change that. Overloaded DSpace remains solid and reliable, but slow.</p>
<h2>Apache Httpd mod_cache to the rescue</h2>
<p>Fortunately this application is a publishing platform. For anonymous users that data changes very slowly and the number of users that log into the application is low. The DSpace URLs are well structured and predictable with no major abuse of the HTTP standard. Event the search operations backed by Solr are well structured. The current data set of 217K items published as html pages represents about 3.9GB of uncompressed data, less if the responses are stored and delivered gzipped. Consequently configuring Apache HTTPD with mod_cache to cache page responses for anonymous users has a dramatic impact on throughput. A trivial test with Apache Benchmark over 100 concurrent connections indicates a peak throughput of around 19K pages per second. I will leave you to do the rest of the maths. I think network will be the limiting factor.</p>
<h2>Loosing statistics</h2>
<p>There are some disadvantages to this approach. Deep within DSpace statistics are recorded. Since the cache will serve most of the content for anon users these statistics no longer make sense. I have misgivings about the way in which the statistics are being collected since if the request is serviced by Cocoon, the access is recorded in a Solr Core by performing an update operation on the core. This is one of the reasons why the throughput is slow, but I also have my doubts that this is a good way of recording statistics. Lucene indexes are often bounded by the cardinality of the index. I worry that over time the Lucene indexes behind the Solr instance recording statistics will overflow available memory. I would have thought, but have no evidence, that recording stats in a <a class="zem_slink" title="Big data" href="http://en.wikipedia.org/wiki/Big_data" target="_blank" rel="wikipedia">Big Data</a> way would be more scalable, and in some ways just as easy for small institutions (ie append only log files, periodically processed with Map Reduce if required). Alternatively, Google Analytics.</p>
<h2>Gottchas</h2>
<p>Before you rush off and mod_cache all your slow applications there is one fly in the ointment. To get this to work you have to separate anonymous responses from authenticated responses. You also have to perform that separation based on the request and nothing else, and you have to ensure that your cache never gets polluted, otherwise anonymous users, including a Google spider, will see authenticated responses. There is precious little in an http request that a server can influence. It can set cookies, and change the url. Applications could segment URL space based on the role of the user, but that is ugly from a URI point of view. Suddenly there are 2 URIs pointing to the same resource. Setting a cookies doesn&#8217;t work, since the response that would have set the cookie is cached, hopefully without the cookie. The solution that worked for us was segment authenticated requests onto https and leave anon requests on http. Then configure the URL space used to perform authentication such that it would not be cached, and ensure an anon users never accessed https content, and an authenticated user, never accesses http content. The latter restriction ensures no authenticated content ever gets cached and the former ensures that the expected tsunami of anon users doesn&#8217;t impact the management of the repository. Much though I would have liked to serve everything over a single protocol on one virtual host the approach is generally applicable to all webapps.</p>
<p>I think the key message is, if you can host using Apache Httpd with mod_mem_cache or even the disk version, then there is no need to jump through hoops to use exotic applications stacks. My testing of Dspace 3 was done with Apache HTTPD 2.2 and all the other components running on a single 4 core box probably well past its sell by date.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=824&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2012/12/18/making-the-digital-repository-at-cambridge-faster/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>

		<media:content url="http://www.dspace.cam.ac.uk/thumbnail/564511/TS-AS-155-155-F.jpg" medium="image" />
	</item>
		<item>
		<title>AIS NMEA and Google Maps API</title>
		<link>http://blog.tfd.co.uk/2012/11/13/ais-nmea-and-google-maps-api/</link>
		<comments>http://blog.tfd.co.uk/2012/11/13/ais-nmea-and-google-maps-api/#comments</comments>
		<pubDate>Tue, 13 Nov 2012 07:41:32 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[AIS]]></category>
		<category><![CDATA[English Channel]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Google Map]]></category>
		<category><![CDATA[GPS navigation device]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=817</guid>
		<description><![CDATA[Those who know me will know I like nothing better than to get well offshore away from any hope of network connectivity. It&#8217;s like stepping back 20 years to before the internet and its blissfully quite. The only problem is that 20 years ago it was too quiet. Crossing the English Channel in thick fog with no [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=817&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Those who know me will know I like nothing better than to get well offshore away from any hope of network connectivity. It&#8217;s like stepping back 20 years to before the internet and its blissfully quite. The only problem is that 20 years ago it was too quiet. Crossing the English Channel in thick fog with no radar and a Decca unit that could only be relied on to give you a fix to within a mile some of the time, made you glad to be alive when you stood on solid ground again. Rocks and strong currents round the <a href="http://en.wikipedia.org/wiki/Alderney">Alderney</a> race were not nearly as frightening as the St Malo ferry looming out of the fog, horns blaring, as if there was anything a 12m yacht could do in reply. After a white knuckle trips I bought a 16nm Radar which turned the unseen steel monster of the Channel into a passage like a tortoise crossing a freeway reading an ipod. I don&#8217;t know which was better, trying to guess if the container ship making 25kn, 10nm away was going to pass in front or behind you, or placing your trust in the unseen ships crew who had spent the past 4 days rolling north through Biscay with no sleep.</p>
<p>Those going to sea today will not experience any of this excitement. They will probably have at least 3 active GPS receivers on board and which will be able to tell them when they are at the bow, stern and sitting on or standing in the heads, (W/C for landlubber).  The second bit of kit that they will probably have is an <a href="http://en.wikipedia.org/wiki/Automatic_Identification_System">AIS</a> receiver. All ships now carry AIS transmitters, as do some yachts whose owners. Vanity domains for boat owners. AIS transmits on marine 2 marine <a class="zem_slink" title="Very high frequency" href="http://en.wikipedia.org/wiki/Very_high_frequency" target="_blank" rel="wikipedia">VHF</a> channels 161.975 MHz and 162.025 MHz using variants of TDMA sharing. The data that is transmitted  is in a standard form NMEA0183 which is the same standard as used in many older marine systems. In the case of AIS, the payload is 8 bit text containing 6bit data in 168bit payload containing a checksum. The information that&#8217;s broadcast is mostly position  speed, course and identification of the sender, which although its intended mainly as a instant communication of intentions between large ships is also invaluable to any smaller craft in fog. Its like being on the bridge of all ships in VHF range at the same time and its relatively simple to calculate the closest point of approach (CPA) for all targets. Pre affordable radar, we used to guess CPA using our ears, and sometimes smell (you can smell a super tanker in a strong wind). With radar we used to try and guess the path of an approaching target from the screen. Easy on a stable platform, not so easy when your radome is doing the samba. Today we have speed and real course often to three significant figures.</p>
<p>I now live about 20km from the Sea north of Sydney harbor. VHF is line of sight so I would have thought it was not going to be possible to receive VHF from that distance taking into account buildings and trees. Normal marine whip aerials probably would not work, but a strip of 300 Ohm ribbon cable cut precisely to length and tuned to resonate with 1/4 wavelengths at 162 MHz is receiving and decoding signals from Newcastle to the north and Wollongong to the south, around 80km in each direction. Not bad for $5 worth of cable. The receiver is a <a href="https://www.whitworths.com.au/main_itemdetail.asp?item=45797&amp;search123=AIS&amp;intAbsolutePage=1">cheap headless unit from ACR</a> that sends the NMEA0183 signals down a USB/serial port. A simple Python scripts receives and decodes the NMEA0183 stream, converting (using ais.py from GPSD project) it into JSON containing current position,  speed, <a class="zem_slink" title="Maritime Mobile Service Identity" href="http://en.wikipedia.org/wiki/Maritime_Mobile_Service_Identity" target="_blank" rel="wikipedia">MMSI number</a> and a host of other information. All very interesting, but not very visual. I could just use one of the many free apps to display the NMEA0183 information over TCP/UDP, but they are limited.</p>
<p><a href="http://ianboston.files.wordpress.com/2012/11/screen-shot-2012-11-13-at-17-35-55.png"><img class="alignleft size-medium wp-image-818" title="Screen shot 2012-11-13 at 17.35.55" alt="" src="http://ianboston.files.wordpress.com/2012/11/screen-shot-2012-11-13-at-17-35-55.png?w=300&#038;h=253" height="253" width="300" /></a>Google Maps v3 API allows Javascript to create an overlay of markers. So a few 100 lines of Javascript loads the json file into a browser every 15s and displays the results on an overlay on Goole Maps. Ships are red with a vector, the wake is green. Sydney is a good place to test this as nearly all the ferries in the harbor transmit AIS messages all the time. Its a busy place. Obviously using Google maps 100nm offshore isn&#8217;t going to work. The next step is to load the Python onto a <a href="http://www.raspberrypi.org/">Raspbery Pi </a>board, plug in a USB Wifi dongle and create my own mobile wifi hotspot which an iPad loaded with marine charts can connect to, all for significantly less that 1 Amp. If your on 12V you care about juice. Having IP offshore does defeat the purpose of being there, I may have to turn it off from time to time to remind myself I am alive.</p>
<p>This interface was just an exercise to validate the NMEA to TCP over Wifi sever works. If you want to know when you ship will come in, visit <a href="http://www.marinetraffic.com/ais/">http://www.marinetraffic.com/ais/</a>, but don&#8217;t try and use it at sea.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=817&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2012/11/13/ais-nmea-and-google-maps-api/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>

		<media:content url="http://ianboston.files.wordpress.com/2012/11/screen-shot-2012-11-13-at-17-35-55.png?w=300" medium="image">
			<media:title type="html">Screen shot 2012-11-13 at 17.35.55</media:title>
		</media:content>
	</item>
		<item>
		<title>HowTo: Quickly resolve what an Sling/OSGi bundle needs.</title>
		<link>http://blog.tfd.co.uk/2012/10/30/howto-quickly-resolve-what-an-slingosgi-bundle-needs/</link>
		<comments>http://blog.tfd.co.uk/2012/10/30/howto-quickly-resolve-what-an-slingosgi-bundle-needs/#comments</comments>
		<pubDate>Tue, 30 Oct 2012 06:52:54 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Apache Maven]]></category>
		<category><![CDATA[BND]]></category>
		<category><![CDATA[Build Management]]></category>
		<category><![CDATA[Java Classloader]]></category>
		<category><![CDATA[osgi]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=813</guid>
		<description><![CDATA[Resolving dependencies for an OSGi bundle can be hard at times, especially if working with legacy code. The sure-fire way of finding all the dependencies is to spin the bundle up in an OSGi container, but that requires building the bundle and deploying it. Here is a quick way of doing it with maven, that [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=813&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Resolving <a class="zem_slink" title="Dependent territory" href="http://en.wikipedia.org/wiki/Dependent_territory" target="_blank" rel="wikipedia">dependencies</a> for an <a class="zem_slink" title="OSGi" href="http://en.wikipedia.org/wiki/OSGi" target="_blank" rel="wikipedia">OSGi</a> bundle can be hard at times, especially if working with legacy code. The sure-fire way of finding all the dependencies is to spin the bundle up in an OSGi container, but that requires building the bundle and deploying it. Here is a quick way of doing it with maven, that may at first sound odd.</p>
<p>If your building your bundle with maven, you will be using the BND tool via the maven-bundle-plugin. This analyses all the byte code that is going into the bundle to work out what will cross over the class-loader boundary. BND via the maven-bundle-plugin has a default import rule of &#8216;*&#8217;. ie import everything. If you are trying to control which dependencies are embedded, which are ignored and which are imported, this can be a hinderance. Strange though it sounds, if you remove it life will be easier. BND will immediately report everything that it needs to import that can&#8217;t be imported. It will refuse to build which is a lot faster than generating a build that won&#8217;t deploy. The way BND reports is also useful. It tells you exactly what it can&#8217;t find and this gives you a list of packages to import, ignore or embed. Once you think you have your list of package imports down to a set that you expect to come from other bundles in your container, turn the &#8216;*&#8217; import back on and away you go.</p>
<p>In maven that means editing the pom.xml eg:</p>
<pre>...
 &lt;plugin&gt;
   &lt;groupId&gt;org.apache.felix&lt;/groupId&gt;
   &lt;artifactId&gt;maven-bundle-plugin&lt;/artifactId&gt;
   &lt;version&gt;2.3.6&lt;/version&gt;
   &lt;extensions&gt;true&lt;/extensions&gt;
   &lt;configuration&gt;
     &lt;instructions&gt;
       &lt;Import-Package&gt;
         &lt;!-- add ignore packages before the * as required eg. !org.testng.annotations, --&gt;
         * &lt;!-- comment the * out to cause BND to report everything its not been told to import --&gt;
       &lt;/Import-Package&gt;
       &lt;Private-Package&gt;
         &lt;!-- add packages that you want to appear as raw classes in the jar as private packages Note, they dont have to source code in the project, they can be anywhere on the classpath for the project, but be careful about resources eg org.apache.sling.commons.cache.infinispan.* --&gt;
       &lt;/Private-Package&gt;
       &lt;DynamicImport-Package&gt;sun.misc.*&lt;/DynamicImport-Package&gt;
       true&lt;/Embed-Transitive&gt;

           &lt;!-- embed dependencies (by artifact ID, including transitives if Embed-Transitive is true) that you dont want exposed to OSGi --&gt;
       &lt;/Embed-Dependency&gt;
     &lt;/instructions&gt;
   &lt;/configuration&gt;
 &lt;/plugin&gt;</pre>
<p>The OSGi purists will tell us that it&#8217;s heresy to embed anything but sometimes with legacy systems it&#8217;s just too painful to deal with the classloader issues.</p>
<p>There is probably a better way of doing this, if so, do tell.</p>
<pre></pre>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=813&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2012/10/30/howto-quickly-resolve-what-an-slingosgi-bundle-needs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>
	</item>
		<item>
		<title>Sakai CLE ElasticSearch</title>
		<link>http://blog.tfd.co.uk/2012/10/11/sakai-cle-elasticsearch/</link>
		<comments>http://blog.tfd.co.uk/2012/10/11/sakai-cle-elasticsearch/#comments</comments>
		<pubDate>Thu, 11 Oct 2012 06:50:56 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Apache Solr]]></category>
		<category><![CDATA[ElasticSearch]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Tony Hoare]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=748</guid>
		<description><![CDATA[A long time ago, I wrote a search module for Sakai 2 as CLE was known then. It attempted to make every node in a CLE instance share the load of indexing and searching and make the search aspect of a CLE cluster scale elastically. To some extents it worked, but it had problems. The indexing [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=748&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://ianboston.files.wordpress.com/2012/10/screen-shot-2012-10-11-at-17-48-24.png"><img class="alignright  wp-image-809" title="Screen shot 2012-10-11 at 17.48.24" alt="" src="http://ianboston.files.wordpress.com/2012/10/screen-shot-2012-10-11-at-17-48-24.png?w=222&#038;h=240" height="240" width="222" /></a>A long time ago, I wrote a search module for Sakai 2 as CLE was known then. It attempted to make every node in a CLE instance share the load of indexing and searching and make the search aspect of a CLE cluster scale elastically. To some extents it worked, but it had problems. The indexing queue was persisted in a DB table and it was based on a old version of Lucene that didn&#8217;t have anything as useful as commits. Consequently it could get its segments into a bit of mess at times. The world has moved on in the 5 years since I wrote that code, and two viable alternatives for supporting Search in Sakai CLE have emerged. <a class="zem_slink" title="Apache Solr" href="http://lucene.apache.org/solr/" target="_blank" rel="homepage">Apache Solr</a> and Elastic Search. Both can be run as remote servers or embedded. Both are solid reliable releases. It could be argued that Solr has more support for sophisticated index schema, and it&#8217;s probably true that Elastic Search is easier to deploy for elastic scaling and real time indexing as that&#8217;s its default behaviour.</p>
<p>For those wanting to try Sakai CLE with Apache Solr as the search server then look no further than the work that Adam Marshall has been doing at Oxford University. That allows you to spin up a Solr instance and connect your Sakai CLE instances to it. You will have to do some reasonably sophisticated master slave configuration to make it resilient to failures and don&#8217;t expect the indexing operations to be real-time. There are plenty of references to the work required to do that in this blog, and arguments why I currently prefer ElasticSearch over Solr.</p>
<h2>Deployment and reliability</h2>
<p><a class="zem_slink" title="ElasticSearch" href="http://www.elasticsearch.org" target="_blank" rel="homepage">ElasticSearch</a> comes out the box being real-time, elastic and cloud aware, with built-in AWS EC2 knowledge as well as rack awareness. Its been built to shard, partition and replicate indexes out of the box. The ElasticSearch client as I am finding out is simple to embed into most environments including <a class="zem_slink" title="OSGi" href="http://www.osgi.org" target="_blank" rel="homepage">OSGi</a> and when embedded makes each app server node a part of elastic search cluster. Best of all, for the nervous by nature, is the resilience that comes from spinning up more than 3 instances in the same cluster. In fact, I have been finding it hard to damage elastic search indexes in tests. It&#8217;s perfectly possible to do all of this with Solr, but the deployer has to work a little harder adding some custom components to support a writeahead log and a Zookeeper instance to manage the cloud.</p>
<h2>Metadata Indexing</h2>
<p>Probably the best part of ElasticSearch is the client which is a fully multithreaded client following the same pattern <a href="http://www.usingcsp.com/">Communicating Sequential Processes</a> first described by <a href="http://research.microsoft.com/~thoare/">Tony Hoare</a> and one of the motivators for the <a class="zem_slink" title="Go (programming language)" href="http://golang.org" target="_blank" rel="homepage">Go language</a>. This allows a client for submit suitably light weight indexing requests to the ElasticSearch cluster via an embedded client without needing to think about managing a queue or the latency of indexing. This nice little feature turns the 1000 lines of code I had to write for Sakai CLE  and OAE search into about 20. Initial tests show that indexing can be done within the request loop and because of the true real-time nature ElasticSearch with its write ahead log, results are available about 50ms after the transaction commits. To maintain that latency, I only index metadata via this route. Document indexing takes a different route.</p>
<h2>Document Indexing</h2>
<p>I found with the original Sakai 2 search and subsequent Solr based indexing of documents in Sakai OAE that indexing bodies was expensive. In some instances tokenizing office documents could place extreme strain on a JVM heap. For that reason when I did the indexing service in the Django version of OAE I did two things. I offloaded the document body indexing operations to separate processors driven by a queue of events, following the CSP pattern mentioned above, and I made the content store single instance. Where users collaborate, they often upload the same document. With a single instance content store, only a single instance of a document is stored and hence, tokenizing and information extraction is only performed once. This greatly reduces the cost of indexing. The store isn&#8217;t collision perfect but by performing a hash on the document body as its saved its possible to eliminate most if not all collisions. Certainly SHA1(ing) enough of the body eliminates all collisions.</p>
<p>So the document indexing processes use the index to locate documents that need to be indexed and then use the single instance content store to eliminate duplicate tokenizing. Using this approach in the Sparse Content Map content system which is already single instance has a dramatic impact on IO. Sakai CLE Content Hosting Service is not single instance at present but could be adjusted to be so once hashes are known. It would be nice to fix that aspect of CHS at some point.</p>
<div class="mceTemp"></div>
<h2>Current state</h2>
<p>I am still working on this code, and this post is part notes, part notification should I get distracted. My testbed is the Sparse Content Map content system only because it builds in 20s, starts in 5, has full integration test coverage and compliant webdav support thanks to Milton. There is currently nothing in the code base that prevents it using Spring or a Webapp container as opposed to OSGi, and the coupling is loose being event driven. The best part is the result should scale as far as ES can scale which is probably a lot larger than any CLE instance in production.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=748&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2012/10/11/sakai-cle-elasticsearch/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>

		<media:content url="http://ianboston.files.wordpress.com/2012/10/screen-shot-2012-10-11-at-17-48-24.png?w=278" medium="image">
			<media:title type="html">Screen shot 2012-10-11 at 17.48.24</media:title>
		</media:content>
	</item>
		<item>
		<title>Fibonacci ring for Cassandra</title>
		<link>http://blog.tfd.co.uk/2012/10/10/fibonacci-ring-for-cassandra/</link>
		<comments>http://blog.tfd.co.uk/2012/10/10/fibonacci-ring-for-cassandra/#comments</comments>
		<pubDate>Wed, 10 Oct 2012 07:58:01 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Apache Cassandra]]></category>
		<category><![CDATA[Cassandra]]></category>
		<category><![CDATA[Distributed computing]]></category>
		<category><![CDATA[Fibonacci]]></category>
		<category><![CDATA[Fibonacci number]]></category>
		<category><![CDATA[Linear Congruential Generator]]></category>
		<category><![CDATA[Vi Hart]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=730</guid>
		<description><![CDATA[No this isn&#8217;t a greek tragedy or some software that I have written, but a thought about the way in which Apache Cassandra an other distributed systems perform problem space decomposition. Cassandra is a good example of a distributed system with problem space decomposition. Its problem space is keys. To be efficient it needs to [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=730&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<div class="wp-caption alignright" style="width: 310px"><a href="http://commons.wikipedia.org/wiki/File:Protea_flower.jpg" target="_blank"><img class="zemanta-img-inserted zemanta-img-configured" title="King Protea (Protea cynaroides)" src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/Protea_flower.jpg/300px-Protea_flower.jpg" alt="King Protea (Protea cynaroides)" width="300" height="200" /></a><p class="wp-caption-text">King Protea (Protea cynaroides) (Photo credit: Wikipedia)</p></div>
<p>No this isn&#8217;t a greek tragedy or some software that I have written, but a thought about the way in which Apache Cassandra an other distributed systems perform problem space decomposition. Cassandra is a good example of a distributed system with problem space decomposition. Its problem space is keys. To be efficient it needs to distribute those keys evenly around its cluster. The key partitioning algorithm normally uses something that generates a flat even distribution. A <a href="http://en.wikipedia.org/wiki/Linear_congruential_generator">Linear Congruential Generator</a>  could be used if you are prepared to live with some banding in the problem space. If not and you are prepared to live with a bit more computational expense one of the hash functions like MD5 or SHAx. In fact the standard key distribution functions in Cassandra use something based on MD5, which to my naive mind must have some collisions.</p>
<p>In reading the Cassandra documentation and using it some years back I became concerned about how elastic Cassandra is. The decomposition of Cassandra&#8217;s key domain is often represented as a ring. That ring is constructed when the cluster is creates and elements are allocated via the key-&gt; ring function, I think they are called partitioners. From reading the documentation, partitioning of this space if fixed and static. If more nodes need to be added to a Cassandra cluster then the partitioning scheme must be updated and data must be migrated from existing nodes in the cluster to their new home before the cluster can become full active again. I think I got that right. That means, although you can replace nodes, you can&#8217;t elastically scale without partitioning work. I am not absolutely clear if that means the re-partitioning work can be done on a live system, or not. I would hope it can.</p>
<p>That got me thinking. There are other systems that repartition effectively during operation. Algebraic Multigrids used to solve high Reynolds number Eulerian grids repartition to accelerate the solution phase. I wrote a parallel AMG solver to run on Cray T3Ds in 1995. It was fast, efficient with good conversion rates  but struggled to beat the Cray vectorised versions of the code base on reasonable sized clusters. There is another. A plant. A plant doesn&#8217;t shutdown when it adds petals to its flower or leaves to its stem it keeps running (so to speak, I havent seen a running flower since University). The plants domain space that its partitioning is sunlight. As it adds leaves doesn&#8217;t add leaves as a whole ring, but it adds them one by one to make the most use of the available sunlight without shading other spaces. It doesn&#8217;t require that the cells from one leaf or petal migrate to the new leaf. In essence a plant has achieved the trick of scaling elastically.</p>
<h2>How does it do this ?</h2>
<p>There is a biological explanation associated to levels of hormones in the stem which are triggered by light levels which could be considered to be as adaptive as the AMG solver is, driven by its solution. Stepping back a bit there is an observation often used in math classes. The number of spirals in many plants is observed to be adjacent numbers in the <a class="zem_slink" title="Fibonacci number" href="http://en.wikipedia.org/wiki/Fibonacci_number" rel="wikipedia" target="_blank">Fibonacci sequence</a>, often 8, 13 and 21 but sometimes as high as 144 spirals. There is a <a href="http://www.khanacademy.org/math/vi-hart/v/doodling-in-math--spirals--fibonacci--and-being-a-plant--1-of-3">delightful explanation</a> of <a class="zem_slink" title="Conifer cone" href="http://en.wikipedia.org/wiki/Conifer_cone" rel="wikipedia" target="_blank">Pinecones</a>, Pineapples, <a class="zem_slink" title="Protea" href="http://en.wikipedia.org/wiki/Protea" rel="wikipedia" target="_blank">Protea</a> and the Fibonacci sequence by <a class="zem_slink" title="Vi Hart" href="http://www.youtube.com/Vihart" rel="youtube" target="_blank">Vi Hart</a>, even if you think you have learnt everything, its fun to watch.</p>
<h2>How is this relevant ?</h2>
<p>I wonder if a Cassandra ring seeded with an initial space that allowed say 5 partitions, but as those partitions passed a threshold of say 30% (with an even distribution) another partition was added. That new partition would attract new keys without requiring migration of the existing keys ensuring that the original partitions never filled. If successful as new nodes were added in the same way as segments are added to a pineapple the Cassandra cluster could scale elastically, or more elastically than it appears to do currently. That really is just a thought, and I havent written a partitioner yet to see if it would work. I think the partitioner would be based on the the ratio of adjacent numbers in the Fibonacci sequence. ie, the <a href="http://en.wikipedia.org/wiki/Golden_angle">Golden Angle</a></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=730&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2012/10/10/fibonacci-ring-for-cassandra/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>

		<media:content url="http://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/Protea_flower.jpg/300px-Protea_flower.jpg" medium="image">
			<media:title type="html">King Protea (Protea cynaroides)</media:title>
		</media:content>
	</item>
		<item>
		<title>Node.js vs SilkJS</title>
		<link>http://blog.tfd.co.uk/2012/09/28/node-js-vs-silkjs/</link>
		<comments>http://blog.tfd.co.uk/2012/09/28/node-js-vs-silkjs/#comments</comments>
		<pubDate>Fri, 28 Sep 2012 09:25:34 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Asynchrony]]></category>
		<category><![CDATA[I/O bound]]></category>
		<category><![CDATA[Node.js]]></category>
		<category><![CDATA[Open source]]></category>
		<category><![CDATA[SilkJS]]></category>
		<category><![CDATA[V8 (JavaScript engine)]]></category>
		<category><![CDATA[WebSocket]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=715</guid>
		<description><![CDATA[Node.js, everyone on the planet has heard about. Every developer at least. SilkJS is relatively new and creates an interesting server to compare Node.js against because it shares so much of the same code base. Both are based on the Google V8 Javascript engine that convert JS into compiled code before executing. Node.js as we all know [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=715&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a class="alignright zemanta-img" href="http://www.flickr.com/photos/78308472@N05/7196742456" target="_blank"><img class="zemanta-img-inserted zemanta-img-configured" title="synchronous ducks" src="http://farm9.static.flickr.com/8013/7196742456_481fb8e06d_m.jpg" alt="synchronous ducks" width="240" height="180" /></a></p>
<p><a class="zem_slink" title="Node.js" href="http://en.wikipedia.org/wiki/Node.js" rel="wikipedia" target="_blank">Node.js</a>, everyone on the planet has heard about. Every developer at least. SilkJS is relatively new and creates an interesting server to compare Node.js against because it shares so much of the same code base. Both are based on the Google <a class="zem_slink" title="V8 (JavaScript engine)" href="http://en.wikipedia.org/wiki/V8_%28JavaScript_engine%29" rel="wikipedia" target="_blank">V8 Javascript engine</a> that convert JS into compiled code before executing. Node.js as we all know uses a single thread that uses a OS level event queue to process events. What is often overlooked is that Node.js uses a single thread, and therefore a single core of the host machine. SilkJS is a threaded server using pthreads where each thread processes the request leaving it upto the OS to manage interleaving between threads while waiting for IO to complete. Node.js is often refereed to as Async and SilkJS is Sync. The advantages to both approaches that are the source of many flame wars. There is a good <a href="http://silkjs.org/sync-vs-async/">summary</a> of the differences and reasons for each approach on the SilkJS website. In essence SilkJS claims to have a less complex programming model that does not require the developer to constantly think of everything in terms of events and callbacks in order to coerce a single thread into doing useful work whilst IO is happening. Although this approach hands the interleaving of IO over to the OS letting it decide when each pthread should be run. OS developers will argue that thats what an OS should be doing and certainly to get the most out of modern multicore hardware there is almost no way of getting away from the need to run multiple processes or threads to use all cores. There is some evidence in the benchmarks (horror, benchmarks, that&#8217;s a red rag to a bull!) from Node.js, SilkJS, Tomcat7, Jetty8, Tornado etc that using multiple threads or processes is a requirement for making use of all cores. So what is that evidence ?</p>
<p>Well, first read why not to trust benchmarks <a href="http://webtide.intalio.com/2010/06/lies-damned-lies-and-benchmarks-2/">http://webtide.intalio.com/2010/06/lies-damned-lies-and-benchmarks-2/</a> once you&#8217;ve read that lets assume that everyone creating a benchmark is trying to show their software off best.</p>
<p>The Node.js 0.8.0 gives a request/second benchmark for a 1K response at 3585.62 request/second. <a href="http://blog.nodejs.org/2012/06/25/node-v0-8-0/">http://blog.nodejs.org/2012/06/25/node-v0-8-0/</a></p>
<p>Over at <a href="http://vertxproject.wordpress.com/2012/05/09/vert-x-vs-node-js-simple-http-benchmarks/">Vert.x</a> there was an of Vert.x and Node.js showing Vert.x running at 300,00 requests/s. You do have to take it with a pinch of salt after you have read another post <a href="http://webtide.intalio.com/2012/05/truth-in-benchmarking/">http://webtide.intalio.com/2012/05/truth-in-benchmarking/</a> with some detailed analysis that points out testing performance on the same box with no network and no latency is theoretically interesting, but probably not informative for the real world. What is more important is can the server stand up reliably forever with no downtime and perform normal server side processing.</p>
<p>So the SilkJS benchmarks in one of its more reasonable benchmarks claim it runs at around 22,000 request per second delivering 13K of file from disk with a very high levels of concurrency 20000. Again its hard to tell how true the benchmark is since many of those requests are pipelined (no socket open overhead), but one thing is clear. With a server capable of handling that level of concurrency some of the passionate arguments supporting async servers running one thread per core are lost. Either way works.</p>
<p>There is a second side to the SilkJS claims that bears some weight. With 200 server threads, what happens when one dies or needs to do something that is not <a class="zem_slink" title="I/O bound" href="http://en.wikipedia.org/wiki/I/O_bound" rel="wikipedia" target="_blank">IO bound</a>? Something mildly non trivial that might use a tiny bit of CPU. With 1 server thread we know what happens, the server queues everything up while the on server thread does that computation. With 200, the OS manages the time spent working on the 1 thread. There is a simple answer, offload anything that does and processing to a threaded environment, but then you might as well use an async proxy front end to achieve the same.</p>
<p>There is a second part of the SilkJS argument that holds some weight. What happens when 1 of the SilkJS workers dies? Errors that kill processes happen for all sorts of reasons, some of them nothing to do with the code in the thread. With 199 threads the server continues to respond, with 0 it does not. At this point everyone who is enjoying the single-threaded simplicity of an async server will, I am sure, be telling me their process is so robust it will never die. That may well be true, but process sometimes dont always die, sometimes they get killed. The counter argument is, what happens when all 199 threads are busy running something. The threaded server dies.</p>
<p>To be balanced, life in an async server can be wonderfully simple. There is absolutely no risk of thread contention since there is only ever one thread, and it doesn&#8217;t matter how long a request might be pending for IO for as all IO is theoretically non blocking. It doesn&#8217;t mater how many requests there are provided there is enough memory to represent the queue. Synchronous servers can&#8217;t do long requests required by <a class="zem_slink" title="WebSocket" href="http://en.wikipedia.org/wiki/WebSocket" rel="wikipedia" target="_blank">WebSockets</a> and CometD. Well they can, but the thread pool soon gets exhausted. The ugly truth is that async servers also have something that gets exhausted  Memory. Every operation in the event queue consumes valuable memory, and with many garbage collected system, garbage collection is significant. Although it may not be apparent at light loads, at heavy loads even if CPU and IO are not saturated, async servers suffer from memory exhaustion and or garbage collection trying to avoid memory exhaustion, which, may appear as CPU exhaustion. So life is not so simple, thread contention is replaced by memory contention which is arguably harder to address.</p>
<h2>So what is the best server architecture for modern web application?</h2>
<p>An architecture that uses threads for requests that can be processed and delivered in ms, consuming no memory and delegating responsibility for interleaving IO to the OS, the resident expert at that task. Coupled with an architecture that recognises long IO intensive requests as such and delegates them to async part of the server, and above all, an architecture on which a simple and straightforward framework can be built to allow developers to get on with the task of delivering applications at webscale, rather than wondering how to achieve webscale with high load reliability. I don&#8217;t have an answer, other than it could be built with Jetty, but I know one thing, the golden bullets on each side of this particular flame war are only part of the solution.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=715&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2012/09/28/node-js-vs-silkjs/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>

		<media:content url="http://farm9.static.flickr.com/8013/7196742456_481fb8e06d_m.jpg" medium="image">
			<media:title type="html">synchronous ducks</media:title>
		</media:content>
	</item>
		<item>
		<title>Google CourseBuilder, a scalable course delivery platform ?</title>
		<link>http://blog.tfd.co.uk/2012/09/15/google-coursebuilder-a-scalable-course-delivery-platform/</link>
		<comments>http://blog.tfd.co.uk/2012/09/15/google-coursebuilder-a-scalable-course-delivery-platform/#comments</comments>
		<pubDate>Sat, 15 Sep 2012 01:51:18 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Google App Engine]]></category>
		<category><![CDATA[Google CourseBuilder]]></category>
		<category><![CDATA[Massive open online course]]></category>
		<category><![CDATA[MOOC]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=677</guid>
		<description><![CDATA[This week I discovered Google CourseBuilder, the latest entry into the MOOC arena. It&#8217;s a Google App Engine application that Google Research used to host a MOOC to 155K students a few months ago. It follows a simular pedagogy to that used by other MOOC providers with high quality video lessons, that give the student the [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=677&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="https://plus.google.com/u/0/photos/116209820892667217551/albums/5787948840585749233/5787948841936777346"><img class="alignright size-medium wp-image-679" title="GoogleCourseBuilderMulti - Computer.m4v" src="http://ianboston.files.wordpress.com/2012/09/googlecoursebuildermulti-computer-m4v.jpg?w=300&#038;h=187" alt="" width="300" height="187" /></a>This week I discovered<a href="https://code.google.com/p/course-builder/" target="_blank"> Google CourseBuilder</a>, the latest entry into the <a class="zem_slink" title="Massive open online course" href="http://en.wikipedia.org/wiki/Massive_open_online_course" rel="wikipedia" target="_blank">MOOC</a> arena. It&#8217;s a <a class="zem_slink" title="Google App Engine" href="http://code.google.com/appengine/" rel="homepage" target="_blank">Google App Engine</a> application that <a class="zem_slink" title="Google" href="http://google.com" rel="homepage" target="_blank">Google Research</a> used to host a MOOC to 155K students a few months ago. It follows a simular pedagogy to that used by other MOOC providers with high quality video lessons, that give the student the feeling they are working one on one with the lecturer. Google have open sourced the code under and <a class="zem_slink" title="Apache License" href="http://en.wikipedia.org/wiki/Apache_License" rel="wikipedia" target="_blank">Apache 2 license</a> which gives us all an insight into the economies of scale that a MOOC represents. Unlike the traditional <a class="zem_slink" title="Virtual learning environment" href="http://en.wikipedia.org/wiki/Virtual_learning_environment" rel="wikipedia" target="_blank">Virtual Learning Environment</a> where the needs of staff are catered for in the user interface, Google CourseBuilder currently delegates all the functionality to spreadsheets, editing snippets of javascript and html. There is no reason why it could not be given an user interface, but when you consider what its is trying to do you realise that staff user interfaces for course creation are less important than the delivery of the course at scale. Consequently the application itself is tightly focused on delivering the course as quickly and as simply as possible to as many users as possible. Google App Engine makes this easy, even for meer mortals. Once you have accepted that nothing is really for free, and you do have to pay for bandwidth used and energy in at some point scaling this application upto 100K or even 1M users requires little or no effort on  your part. You also, at the moment, have to accept if you are going to reach that many students, you are going to have to ask for a little bit of help from someone to write some HTML, drive a spreadsheet and write a bit of Javascript as well as hit the &#8220;deploy&#8221; button on the App Engine SDK. I say, at the moment, because it isn&#8217;t going to be that hard to create an administrative UI, and thats what I have been doing for a few hours this week.</p>
<p>So the reality is, very few lecturers are going to create a course that will be delivered to 155K students, and if they succeed in going viral, the drop out rate is likely to be very high. The course Google ran issued 22K certificates, indicating a drop out rate of 85%. Its still an impressive number when many campuses are no where near that size however, most institutions would not survive with that level of drop out and all would be looking at ways of reducing it. Institutions invest more in their students and so need lower levels of drop out. As a result, their courses are smaller, they don&#8217;t have the economies of scale and can&#8217;t invest as much in the delivery of each individual course. All is not lost however, the opportunity that Googles CourseBuilder represents could be utilized if there was a small reduction in effort associated with course creation and course delivery.</p>
<p>The video attached to this blog post shows how that might be achieved. This is a modified version of Google CourseBuilder that allows a single Google App Engine to host more than one course. It could easily host a course catalogue from an small institution or medium size faculty. That course catalogue is uploaded via a spreadsheet. Individual courses containing units and lessons are also uploaded via seperate spreadsheets.</p>
<p>Students sign in using their <a class="zem_slink" title="Google Account" href="http://https://accounts.google.com/" rel="homepage" target="_blank">Google ID</a>, <a class="zem_slink" title="List of Google products" href="http://www.google.com" rel="homepage" target="_blank">Google Apps for Education</a> ID, or <a class="zem_slink" title="OpenID Foundation" href="http://openid.net" rel="homepage" target="_blank">OpenID</a>. They then register with the the courses they want to take. If you want to give it a try there is a App Engine Instance running at <a href="http://cbmultidemo.appspot.com/">http://cbmultidemo.appspot.com/</a>, bear in mind its a free instance so may become unavailable.</p>
<p>At the moment the administrative interface is very basic, but the intention is to build that up to allow courses to be created without needing to resort to technical resources. So far I have spent about 4h eliminating most of the code base editing and adding multi course capability. The code base is available as a fork of the Google CourseBuilder project and can be deployed by anyone with a Google ID. Since the original code was written in Python, using a modern variant of the GAE framework porting to Django would be trivial  with those who have concern about running on Google infrastructure. Obviously in doing so, you will have to work out how to do the scaling, see Instagram for pointers on that.</p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&#038;blog=6575768&#038;post=677&#038;subd=ianboston&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2012/09/15/google-coursebuilder-a-scalable-course-delivery-platform/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>

		<media:content url="http://ianboston.files.wordpress.com/2012/09/googlecoursebuildermulti-computer-m4v.jpg?w=300" medium="image">
			<media:title type="html">GoogleCourseBuilderMulti - Computer.m4v</media:title>
		</media:content>
	</item>
	</channel>
</rss>
