<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Timefields</title>
	<atom:link href="http://blog.tfd.co.uk/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.tfd.co.uk</link>
	<description>Open Source Open Thought</description>
	<lastBuildDate>Fri, 10 Feb 2012 00:57:04 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='blog.tfd.co.uk' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Timefields</title>
		<link>http://blog.tfd.co.uk</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://blog.tfd.co.uk/osd.xml" title="Timefields" />
	<atom:link rel='hub' href='http://blog.tfd.co.uk/?pushpress=hub'/>
		<item>
		<title>Access Control Lists in Solr/Lucene</title>
		<link>http://blog.tfd.co.uk/2012/02/08/access-control-lists-in-solrlucene/</link>
		<comments>http://blog.tfd.co.uk/2012/02/08/access-control-lists-in-solrlucene/#comments</comments>
		<pubDate>Wed, 08 Feb 2012 05:09:36 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[ApacheSolr]]></category>
		<category><![CDATA[Information Retrieval]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=567</guid>
		<description><![CDATA[This isn&#8217;t so much about access control lists in Solr or Lucene but more about access control lists in an inverted index in general. The problem is as follows. We have a large set of data that is access controlled. The access control is managed by users and they can individual items closed or open [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=567&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>This isn&#8217;t so much about access control lists in Solr or Lucene but more about access control lists in an <a class="zem_slink" title="Inverted index" href="http://en.wikipedia.org/wiki/Inverted_index" rel="wikipedia">inverted index</a> in general. The problem is as follows. We have a large set of data that is access controlled. The access control is managed by users and they can individual items closed or open or anywhere between. The access control lists on the content, which may be files, or simply bundles of metadata is of the form 2 bitmaps, representing the permissions granted and denied, each pair of bitmaps being associated with a principal and the set of principal/bitmap pairs associated with each content item. A side complication is that the content is organised hierarchically and permissions for any one user inherit following the hierarchy back to the root of the tree. Users have many principals through membership of groups, through directly granted static principals and through dynamically acquired principals. All of this is implemented outside of the Solr in a content system. Its Solr&#8217;s task to index the content in such a way that a query on the content for an item is efficient and returns a dense result set that can have the one or two content items that the user can&#8217;t read, filtered out before the user gets to see the list. Ie we can tolerate a few items the user can&#8217;t read being found by a Solr query, but we cant tolerate most being unreadable. In the ACL bitmaps, we are only interested in a the read permission.</p>
<p>The approach I took to date was to look at each content item or set of metadata when its updated, calculate a set of all principals that can read the item and add those principals as a multivalued keyword property of the Solr document. The query, performed by a user computes the principals that the user has at the time they are making the query, and builds a Solr query that gets any document matching the query and with a reading principal in that set. Where the use of principals is moderate, this works well and does not overload either the <a class="zem_slink" title="Cardinality" href="http://en.wikipedia.org/wiki/Cardinality" rel="wikipedia">cardinality</a> of the inverted index where the reader principals are stored in Solr or the size of the Solr query. In these cases the query can be serviced as any other low cardinality query would be, by looking up and accumulating the bitmap representing the full set of documents for each reader principal in turn. The query then requires n lookups and accumulate operations, where n is the number of principals the user has, to resolve the permissions part of the query.</p>
<p>However, and this is the reason for this post, where this fails is where the cardinality of the reader principals becomes to high, or the number of principals that a user has is too high. Unfortunately those two metrics are connected. The more principals there are in a system, the more a user will need to access information, and so the reader principal mechanism can begin to break down. The alternative is just as unpleasant, where the user only has a single principal, their own. In those scenarios active management of ACLs in the content system becomes unscalable both in compute and human terms, which is why principals representing groups were introduced in the first place. Even if there were not limits to the size of a Solr query the cost of processing 1024 terms is prohibitive for anything other than offline processing.</p>
<div class="wp-caption alignright" style="width: 310px"><a href="http://commons.wikipedia.org/wiki/File:Bloom_filter.svg"><img class="zemanta-img-inserted zemanta-img-configured" title="Example of a Bloom filter" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/ac/Bloom_filter.svg/300px-Bloom_filter.svg.png" alt="Example of a Bloom filter" width="300" height="108" /></a><p class="wp-caption-text">Image via Wikipedia</p></div>
<p>One solution that has been suggested is to use a <a class="zem_slink" title="Bloom filter" href="http://en.wikipedia.org/wiki/Bloom_filter" rel="wikipedia">Bloom filter</a> to represent the set of principals that the user has and test each indexed principal against this filter. If this is done as part of the query, as the result set is being created there is no gain over scanning all documents since the inverted index would not be used. There could be some benefit in using this approach once a potential set of documents is generated and sorted, since the cost of performing sufficient hashes to fill the appropriate set of bloom buckets is low enough that it could be used as a post query filter. I think the term in Solr is a Collector. In this scenario we are already verifying the user can read a content item or its metadata before delivering to the user, hence its acceptable to have a less that perfect set of pointers being emitted from Solr, provided that the set of pointers we retrieve is dense. We can&#8217;t afford to have a situation where the set of pointers is sparse, say several million items, and the only item the user can read is the last one. In that scenario any permissions checking performed without the benefit of Solr would be super expensive.</p>
<p>So, the Bloom filter applied within Solr has the potential to be able to filter most result sets rapidly enough to create a result set that is dense enough for more thorough checking. How dense does it need to be and how large does the bloom filter need to be ? That is an open-ended question, however if, on average you had to read 20% more elements than you returned that might not be excessive if results sets were generally limited to no more than the first 100 items. If that&#8217;s the case then 80% density is enough. Bloom provides a guarantee of no false negatives but a probability of a % of false positives, ie items the a Bloom filter indicates are present in a set, but which, are not. For classical Bloom filters 20% is an extremely high probability of false positive, certainly not acceptable for most applications. It has been reported in a number of places that the quality of the <a class="zem_slink" title="Hash function" href="http://en.wikipedia.org/wiki/Hash_function" rel="wikipedia">hash function</a> used to fill the buckets of the filter is also of importance in the number of false positives since an uneven distribution of hashes over the number space represented by the Bloom bitmap will result in inefficient usage of that bitmap. Rather than doing the math which you will find on <a href="http://en.wikipedia.org/wiki/Bloom_filter">Wikipedia</a>, knowing that all fast hashes are less than perfect, and being a pragmatist I did an experiment. In Apache Hadoop there are several implementations of the Bloom filter with 2 evenly distributed and efficient hash functions, <a href="http://en.wikipedia.org/wiki/Jenkins_hash_function">Jenkins</a> and <a href="http://en.wikipedia.org/wiki/MurmurHash">Murmur2</a>, so I have used that implementation. What I am interested in is how big a filter would I need to get 80% density to a set of results and how big would that bitmap need to be as the number of inputs to the bloom filter (the users principals) rises. It turns out, that very small bitmap lengths will give sufficient density where the number of input terms is small, even if the number of tested principal readers is high. So 32 bytes of Bloom filter is plenty large enough to test with &lt; 20 principals. Unfortunately however, the cardinality of these bitmaps is too high to be a keyword in an inverted index. For example, if the system contained 256K principals, and we expected users on average to have no more than 64 principals we would need a bloom filter of no more than 256 bits to generate, on average 80% density. Since we are not attempting to index that bloom filter the cardinality of 2^^256 is not an issue. Had we tried to, we would almost certainly have generated an unusable inverted index. Also, that Bloom filter is constructed for each users query, we can dynamically scale it to suit the conditions at the time of the query (number of items in the system, and number of principals the user has). Real system with real users have more principals and sometimes users with more principals. A system with 1M principals that has on average 1024 principals per user will need a bloom filter containing about 8Kbits. Its certain that adding a 8Kbit token ( or a 1Kbyte[] ) as a single parameter to a Solr query circumvents the issue surrounding the number of terms we had previously, but it&#8217;s absolutely clear that the cardinality of 2^^8196 is going to be well beyond indexing, which means that the only way this will work is to post filter a potentially sparse set of results. That does avoid rebuilding the index.</p>
<p>From this small experiment I have some questions unanswered:</p>
<ul>
<li>Will converting a potentially sparse set of results be quick enough, or will it just expose another DoS vector?</li>
<li>What will be the practical cost performing 100 (principals) x 20 (I was using 20 hashes) into an 8kbit filter to filter out each returned doc item?</li>
<li>Will the processing of queries this way present a DOS vector?</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianboston.wordpress.com/567/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianboston.wordpress.com/567/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/ianboston.wordpress.com/567/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/ianboston.wordpress.com/567/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/ianboston.wordpress.com/567/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/ianboston.wordpress.com/567/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/ianboston.wordpress.com/567/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/ianboston.wordpress.com/567/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/ianboston.wordpress.com/567/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/ianboston.wordpress.com/567/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/ianboston.wordpress.com/567/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/ianboston.wordpress.com/567/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/ianboston.wordpress.com/567/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/ianboston.wordpress.com/567/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=567&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2012/02/08/access-control-lists-in-solrlucene/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>

		<media:content url="http://upload.wikimedia.org/wikipedia/commons/thumb/a/ac/Bloom_filter.svg/300px-Bloom_filter.svg.png" medium="image">
			<media:title type="html">Example of a Bloom filter</media:title>
		</media:content>
	</item>
		<item>
		<title>Deprecate Solr Bundle</title>
		<link>http://blog.tfd.co.uk/2012/02/02/deprecate-solr-bundle/</link>
		<comments>http://blog.tfd.co.uk/2012/02/02/deprecate-solr-bundle/#comments</comments>
		<pubDate>Thu, 02 Feb 2012 23:18:33 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Apache Solr]]></category>
		<category><![CDATA[content repository]]></category>
		<category><![CDATA[Enterprise Content Management]]></category>
		<category><![CDATA[index updates]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[repository system]]></category>
		<category><![CDATA[social content]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://ianboston.wordpress.com/?p=556</guid>
		<description><![CDATA[Before that scares the hell out of anyone using Solr, the Solr bundle I am talking about is a small shim OSGi bundle that takes content from a Social Content Repository system called Sparse Map and indexes the content using Solr, either embedded  or as a remote Solr cluster. The Solr used is a snapshot from [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=556&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Before that scares the hell out of anyone using Solr, the <a href="https://github.com/ieb/solr/">Solr bundle </a>I am talking about is a small shim OSGi bundle that takes content from a Social Content Repository system called <a href="https://github.com/ieb/sparsemapcontent">Sparse Map</a> and indexes the content using Solr, either embedded  or as a remote Solr cluster. The Solr used is a snapshot from the current 4.0 development branch of Solr. Right, now thats cleared up I suspect 90% of the readers will leave the page to go and read something else ?</p>
<p>So Solr 4 works just great. The applications using Sparse Map, like Sakai OAE , have a high update rate and are adding to the index continuously. The bundle queues updates and processes them via a single threaded queue reader into the index which is configured to accept soft commits and perform periodic flushes to disk. The Solr instance is unmodified from the standard Solr 4 snapshot and we have had no problems with it. Provided the cadinality of the fields that the application indexes are not insane, and the queries are also achievable there are no performance issues with queries being the sub 10ms that we have all become accustomed to from Solr. Obviously if you do stupid things you can make a query in Solr take seconds.</p>
<p>There are however some issues with the way the bundle works and certainly when deployed into production into a real cluster there are issues. No one would seriously run the Sparse Map with this Solr bundle on a single app server node for anything other than development or testing, so the default Embedded Solr configuration is a distraction. If your not writing code with the intention of deploying into production, then why write the code? Life is to short, unless your an academic on track to a Nobel prize. When deployed, the bundle connects to a remote Solr master for indexing with one or more Solr slaves hanging off the master (polling not being pushed to). There are several problems with this configuration. If the master goes down, no index updates can happen. This doesn&#8217;t break the Solr bundle since it queues and recovers from master failure with a local write ahead transaction log or queue. It does break the indexed data on the master since anything in memory on the master will be lost, and only those segments on disk will get propagated to the Solr slaves when the master recovers. This is a rock and a hard place. 1s commits with propagation cause unsustainable segment traffic with high segment merge activity. Infrequent commits will just loose data and destroy data propagation rates. The slaves, being read only are expendable provided there are always enough to service the load. Thats sounds like the definition of a slave, I would not like to be one, but then I wouldn&#8217;t know if I was.</p>
<p>Solr, in this configuration, wasn&#8217;t really designed for this type of load. If we indexing new documents at the rate of 1 batch an hour then Solr in this configuration would be prefect. However the updates can come through at thousands per second. So although it works, its fine, but when it breaks it will break and leave the index in some unknown state. The problem is rooted in how the indexing is done and where the write ahead log or queue is stored. Its fine for a single instance since the write ahead log is local to the embeded Solr instance but no good for a cluster.</p>
<h2>Other approaches</h2>
<p>There are lots of ways to solve this problem. It was solved in Sakai 2 (CLE) search which treated segments as immutable and sent them to a central location for distribution to each app server. Writers on each app server wrote to local indexes and on commit the segment was pushed to a central location where the segment was pushed to all other app server nodes. The implementation was less than perfect and there were all sorts of timing issues especially when it came to merging and optimising. That code was written in 2006 on a very old version of Lucene (1.9.1 IIRC). So old it didn&#8217;t have commit, let alone soft commits and it was only used for relatively slow rates of update supporting non critical user functionality. Its in production many Sakai 2 schools. Every now and again a segment gets corrupted and that corruption propagates slowly over the whole cluster with each merge and optimise. Eventually full index rebuilds are needed which can be carried out when in full production but are best done overnight when the levels of concurrency are lower.</p>
<p>At the time we had considered using the DB based IndexReaders and IndexWriters from the Compass project. These were readers and writers that used a DB BLOB as the backing store. Lucene depends heavily on seek performance, and doing seek over a network into the DB blob, doesn&#8217;t work. The IO required to retrieve sections of the segments to pull terms is so high that search speed is a bit low (British understatement, stiff upper lip and all that). After tests those drivers were rejected for the Sakai 2 work. It might have worked on an Oracle DB where seeks in blobs is supported and you can do some local caching, but on MySQL it was a non stater.</p>
<p>The next approach is that used by Jackrabbit. The Lucene index is embedded in the repo. Every repo has a local index with updates being written directly to all index sychronised across the cluster. Works well on one app node, but suffers in a cluster since ever modification to the local index has to be serialised over the entire cluster. Depending on the implementation of that synchronisation it can make the whole cluster serialized on update. Thats ok if the use case is mostly read as it is with the Enterprise Content Management use case, but in a Social Content Repository the use case is much higher update. App servers cant wait in a queue to get a lock on the virtual cluster wide index before making their insert and inserting a pointer into a list to tell all others their done.</p>
<p>Since 2006 the world has not stood still and there have been lots of people looking at this space. LinkedIn opensources <a href="http://javasoze.github.com/zoie/">Zoie</a> and <a href="http://sna-projects.com/bobo/">Bobo</a> that deliver batched updated into distributed indexes and then build faceted search from those indexes. Although these would work for a Social Content Repository my feeling was the quality of data service (time it takes from a content item update to the index presence) was too high and required lots of discipline in the coding of the application to ensure that data local to the user was published directly to the content system rather than discovered via the search index. The area of immediate impact of data for LinkedIn is well defined, the users view of their profile etc so that QoDS can be higher than where an update might have to instantly propagate to 100s of users. The types of use cases I was targetting with the Sparse were more like Google+ where groups take a greater prominence. Except in Education, the group interaction is real time which pushed the QoDS down into the second or sub second range. So Zoie was ground breaking, but not enough. The work on this application, now Sakai OAE, started in 2008 when there was nothing else (visible) around. We started with SLing based on Jackrabbit and use its search capabilities, until we realised that a Social Content Repository has to support wide shallow hierarchies with high levels of concurrent update the Enterprise Content Management model is deep narrow hierarchies with lower levels of concurrent update. See <a href="www.slideshare.net/ianeboston/sparse-content-map-storage-system">this</a> for detail</p>
<p>Roll forwards to 2010 when we pulled in Solr 4 which was just about to get the NRT patches applied. It looked, bar the small issue of cluster reliability that it was an Ok choice. And now were up to date 2012 and the world of distributed search has moved on and I want to solve the major blocker of reliability. I don&#8217;t want to have to write a distributed index as I did for Sakai 2, partly because there are many others out there doing the same thing better than I have time to. I could use SolrCloud, although IIUC that deals with the cloud deployment of Shards of SolrSlaves rather than addressing the reliability of high volume updates to those shards.</p>
<h2>Terms, Documents or Segments</h2>
<p>What to shard and replicate. The ability to shard will ensure scalability in the index, which turns the throughput model from a task compute farm into a parallel machine using the simplest of gather scatter algorithms (my PhD and early research was numerical parallel solutions on early MPP hardware, we always looked down on gather scatter since if never worked for highly interconnected and dynamic problem sets, sorry if thats offensive to MapReduce aficionados, btw gather scatter is the right algorithm here). The ability to replicate, many times, will ensure that we don&#8217;t have to thing about making hardware resilient. But what to shard and replicate. The Compass IndexReader and IndexWriter DB implementation proved that inverted indexes need high seek speeds to minimise the cost of scanning segments for terms. Putting latency between the workings of the inverted index and its storage was always going to slow an index down and even if you made segment and terms local to processing, processing queries on partial documents (shards of terms) creates imbalance in the processing load of a parallel machine and dependence on the queries. The reason for less than perfect parallel speedup on numerical problems in 1990 was almost always due to imperfect load balance in the algorithm. Pausing the solution for a moment to wait for other machines to finish is a simple bottleneck. Even if sharding and replication of partial documents or terms balances over the cluster of search machines, the IO to perform anything but the simplest query is probably going to dominate.</p>
<p>So I need an index implementation that shards and replicates documents. Its 2012 and a lot has happend. The author of Compass Shay Banon (@kimchy) went on to write <a href="http://www.elasticsearch.org/">ElasticSearch</a> with a bunch of other veterans. It looks stable and has considerable uptake with drivers for most languages. It abandons the store segments centrally model of Compass and Sakai 2 and replicates the indexing operation so that documents are shaded and replicated. Transporting a segment over the network after a merge operation, as Solr Master/Slave does is time consuming, especially if you have everything in a single core and you merged segment set have become many GB in size. This looks like a prime contender for replacing the search capability since its simple to run, self configuring and discovering and ticks all the boxes as far as scaling, reliability and ease of use.</p>
<p>Another contender is <a href="http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/">Lucandra</a>. Initially this was just Lucene on top of Cassandara. It implemented the IndexReader and IndexWriter inside Cassandra without segments eliminating the need to manage segments but also loosing most of the optimisations of memory mapped data. Unlike the Compass IndexReader and IndexWriter that wrote segments to DB blobs the structure of the index is columns and rows inside Cassandra. Not dissimular from the Sparse Map Cassandra driver that indexes by writing its own inverted index as it goes. There are some performance gains since if you put the Lucandra class into the Cassandra JVM the data is supposedly local, however Cassandra data is replicated and shaded so there is still significant IO between the nodes and the solution may benefit from Cassandras ability to cache, but will still suffer from the same problems that all term based or partial document sharding suffers from. Poor performance due to IO. When Lucandra became <a href="http://blog.sematext.com/2011/09/09/the-state-of-solandra-summer-2011/">Solandra</a> a year later in the authors reported the performance issues, but also reported a switch to sharding by document.</p>
<p>There will be more out there, but these examples show that the close source community implementing large distributed indexes on a document based shard and replicate approach is the right one to follow. (Hmm isn&#8217;t that what the 1998 paper from some upstarts titled &#8220;<a href="http://infolab.stanford.edu/pub/papers/google.pdf">The Anatomy of a Large-Scale HypertextualWeb Search Engine</a>&#8221; said ? The authors of Solandra admit that it still looses many of the optimisations of the segment but rightly point out if your deploying infrastructure to manage millions of small independent indexes then the file system storage issue become problematic which is where the management of storage by Cassandra becomes an advantage. As of September 2011 I get the impression that ElasticSearch is more mature than Solandra, and although everyone itches these days to play with a new tool in production (like a column DB) and throw away the old and reliable file system, I am not convinced that I want to move just yet. Old and reliable is good, sexy and new always gets me into trouble.</p>
<p>I think, I am going to deprecate the Solr bundle used for indexing content in Sparse Map and write a new bundle targeting ElasticSearch. It will be simpler, since I can use the write ahead transaction log already inside elastic search, its already real time (1s latency to commits and faster than that for non flushed indexes). I have also found references to it supporting bitmap <a href="http://en.wikipedia.org/wiki/Bloom_filter">bloom</a> filter fields which means I can now embed much larger scale ACL reader indexing within the index itself. A post to follow on that later. Watch this space.</p>
<h4><em><br />
</em></h4>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianboston.wordpress.com/556/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianboston.wordpress.com/556/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/ianboston.wordpress.com/556/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/ianboston.wordpress.com/556/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/ianboston.wordpress.com/556/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/ianboston.wordpress.com/556/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/ianboston.wordpress.com/556/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/ianboston.wordpress.com/556/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/ianboston.wordpress.com/556/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/ianboston.wordpress.com/556/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/ianboston.wordpress.com/556/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/ianboston.wordpress.com/556/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/ianboston.wordpress.com/556/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/ianboston.wordpress.com/556/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=556&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2012/02/02/deprecate-solr-bundle/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>
	</item>
		<item>
		<title>Rogue Gadgets</title>
		<link>http://blog.tfd.co.uk/2012/01/05/rogue-gadgets/</link>
		<comments>http://blog.tfd.co.uk/2012/01/05/rogue-gadgets/#comments</comments>
		<pubDate>Thu, 05 Jan 2012 00:20:24 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Gadget]]></category>
		<category><![CDATA[OCLC]]></category>
		<category><![CDATA[OpenSocial]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=552</guid>
		<description><![CDATA[I have long thought one of the problems with OpenSocial is its openness to enable any Gadget based app anywhere. Even if there is a technical solution to the problem of a rogue App in the browser sandbox afforded by the iframe that simply defers the issue. Sure, the Gadget code that is the App, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=552&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I have long thought one of the problems with OpenSocial is its openness to enable any Gadget based app anywhere. Even if there is a technical solution to the problem of a rogue App in the browser sandbox afforded by the iframe that simply defers the issue. Sure, the Gadget code that is the App, can&#8217;t escape the iframe sandbox and interfere with the in browser container or other iframe hosted apps in from the same source. Unfortunately wonderful technical solutions are of little interest to a user whose user experience if impacted by the safe but rogue app. The app may be technically well behaved, but downright offensive and inappropriate on many other levels, and this is an area which has given many institutions food for thought when considering gadget based platforms like Google Apps for Education. A survey of the gadgets that a user could deploy via an open gadget rendering endpoint reveals that many violate internal policies of most organizations. Racial and sexual equality are often compromised. Even basic decency. It&#8217;s the openness of the gadget renderer that causes the problem, in many cases when deployed, it will render anything its given. It&#8217;s not hard to find gadgets providing porn in gmodules the source of iGoogle, not exactly what an institution would want to endorse on its staff/student home pages.</p>
<p>For too long there has been an assumption that it&#8217;s the responsibility of the user to self police. That&#8217;s fine where the environment is offered by an organisation that can claim to be &#8221;only the messenger&#8221;, but when an environment is offered by an organization that is more than a messenger, self policing doesn&#8217;t hold water. The weakness of the OpenSocial gadget environment is its openness. It&#8217;s hard, if not impossible to control what gadgets are available and put the onus on the container to control what is loaded.</p>
<h2>Trusting Mobile Apps</h2>
<p>There is a parallel to this problem in the mobile device industry seen in the difference between Android and iOS. Android is open, the environment allows developers to do almost anything they like and have full access to all features of the phone. The Android Market with over 400K apps on it is often reported as being <a href="http://www.techradar.com/news/phone-and-communications/mobile-phones/400000-apps-now-available-on-android-market-1051674">&#8220;wild west&#8221;</a>  to quote &#8220;&#8230;Unlike Apple&#8217;s strict approval policy, the Android Market is seen a little like the Wild West of the mobile, with many applications getting through which would never make the cut on iOS&#8230;. &#8220;. That leaves the user with plenty of choice but exposed to a lot of risk. It&#8217;s spawning an industry of FUD, based on real fears and dangers generating a new revenue stream for those that profited from virus and malware explosions on PCs. This time it&#8217;s a mobile device where the user may have placed far more trust in the device than they know (money, bank details, authentication, liability), and has far less ability to do anything about it (there I go, adding to the FUD).</p>
<p>Don&#8217;t get me wrong, as a developer, I don&#8217;t like the iOS approval process, but I think it&#8217;s a necessary evil to ensure that those providing the market place or store know that what they are pushing onto the unsuspecting public won&#8217;t do harm. Firstly the iOS platform protects the device from the rogue developer. Secondly the approval process ensures that the app conforms to the guidelines, not eating the battery or using up all the users monthly bandwidth allowance in a day. Thirdly, although not always the case, the approval process ensures that the soft factors of the app are acceptable. I haven&#8217;t tried, but I suspect an app that worked as a terrorist bomb trigger app, and gave step by step instructions how to do it would not pass the soft factors inspection. Consequently users of the iOS platform feel that they can trust the apps they are being sold. There is no aftermarket industry in end-user protection as there is no business case to support it.</p>
<p>In the Gadget environment, it&#8217;s the gadget renderer that is the equivalent to the store. By rendering a gadget, the renderer is not just a &#8220;messenger&#8221; not to be blamed, it&#8217;s saying something about what its rendering. If the gadget renderer doesn&#8217;t do that, then I have to argue that you should not trust the gadget rendered. It could be pushing anything at you, you might trust it, but if it doesn&#8217;t trust what it&#8217;s sending you, how can you trust what it sends? Would you accept a package from a person in a uniform before boarding a plan, just because the uniform had a badge with the word &#8220;security&#8221; on it? No, neither would I. If they had a gun and ID, I would still ask them why I should be trusted to carry it.</p>
<h2>OCLC WorldShare</h2>
<p>There are some OpenSocial gadget renderers that care about their reputation. Most Libraries are considered to be trusted sources of information and <a href="http://en.wikipedia.org/wiki/Online_Computer_Library_Center">OCLC</a> with a membership of 72000 libraries, museums and archives in 170 countries has a reputation it and its membership cares about. OCLC recently launched <a href="http://www.oclc.org/us/en/worldshare-platform/howitworks/default.htm">WorldShare</a>, an OpenSocial based platform that uses Apache Shindig to render Gadgets and provide access for those gadgets to a wealth of additional information feeds. It does not provide the container in which to mount the Gadgets but it provides a trusted and respected source of rendered Gadgets. This turns the OpenSocial model on its head. A not for profit organisation delivering access to vast stores of information via OpenSocial and the Gadget feeds. Suddenly the gadget rendered feed is the only thing that matters. The container could be provided by OCLC, but equally by members. OCLC has wisely decided to <a href="http://www.oclc.org/developer/platform/certification">certify</a> any gadget that it is prepared to serve. Like the iOS certification and approval process, WorldShare&#8217;s certification is based on technical and soft criteria. That process will hopefully ensure quality, add value and protect its uses from the wild west. Just as we trust our libraries to truthfully hold and classify knowledge, I hope that the WorldShare&#8217;s realisation that the vendor has a responsibility, will give as all the confidence to continue to trust OCLC as a source.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianboston.wordpress.com/552/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianboston.wordpress.com/552/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/ianboston.wordpress.com/552/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/ianboston.wordpress.com/552/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/ianboston.wordpress.com/552/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/ianboston.wordpress.com/552/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/ianboston.wordpress.com/552/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/ianboston.wordpress.com/552/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/ianboston.wordpress.com/552/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/ianboston.wordpress.com/552/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/ianboston.wordpress.com/552/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/ianboston.wordpress.com/552/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/ianboston.wordpress.com/552/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/ianboston.wordpress.com/552/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=552&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2012/01/05/rogue-gadgets/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>
	</item>
		<item>
		<title>SparseMap Content version 1.4 released.</title>
		<link>http://blog.tfd.co.uk/2011/12/14/sparsemap-content-version-1-4-released/</link>
		<comments>http://blog.tfd.co.uk/2011/12/14/sparsemap-content-version-1-4-released/#comments</comments>
		<pubDate>Wed, 14 Dec 2011 04:39:11 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=549</guid>
		<description><![CDATA[Sparse Map version 1.4 has been tagged (org.sakaiproject.nakamura.core-1.4) and released. Downloads of the source tree in Zip and TarGZ form are available from GitHub. In this release 6 issues were addressed, the details are in the issue tracker.  The main difference you will notice in this release is the size of the core. The jar has shrunk from over 2MB to just [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=549&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Sparse Map version 1.4 has been tagged (<a href="https://github.com/ieb/sparsemapcontent/tree/org.sakaiproject.nakamura.core-1.3">org.sakaiproject.nakamura.core-1.</a>4) and released. Downloads of the source tree in <a href="https://github.com/ieb/sparsemapcontent/zipball/org.sakaiproject.nakamura.core-1.4">Zip</a> and <a href="https://github.com/ieb/sparsemapcontent/tarball/org.sakaiproject.nakamura.core-1.4">TarGZ</a> form are available from GitHub.</p>
<p>In this release 6<a href="https://github.com/ieb/sparsemapcontent/issues?sort=created&amp;direction=desc&amp;state=closed&amp;page=1&amp;milestone=5"> issues</a> were addressed, the details are in the <a href="https://github.com/ieb/sparsemapcontent/issues">issue tracker</a>.  The main difference you will notice in this release is the size of the core. The jar has shrunk from over 2MB to just over 200KB. This is due to the introduction of a Service Provider Interface for the raw storage layer. Implementations of the Service Provider Interfaces have been released separately for Derby, <a class="zem_slink" title="MySQL" href="http://en.wikipedia.org/wiki/MySQL" rel="wikipedia">MySQL</a> and <a class="zem_slink" title="PostgreSQL" href="http://en.wikipedia.org/wiki/PostgreSQL" rel="wikipedia">PostgreSQL</a>. Due to the licensing surrounding the Oracle JDBC driver I have not released a binary of the Oracle SPI implementation, however there is a tagged release in the source repository. I have also restrained from releasing the SPI implementations for Cassandra, <a class="zem_slink" title="HBase" href="http://en.wikipedia.org/wiki/HBase" rel="wikipedia">HBase</a> and <a class="zem_slink" title="MongoDB" href="http://en.wikipedia.org/wiki/MongoDB" rel="wikipedia">MongoDB</a> as I am not satisfied the implementations are sufficiently tested or complete.</p>
<p>If you find any issues, please mention them to me or, better still, add an issue to the issue tracker. Unless otherwise stated the license is Apache 2. Thanks to everyone who made this release possible.</p>
<pre>Tag:  <a href="https://github.com/ieb/sparsemapcontent/tree/org.sakaiproject.nakamura.core-1.4">https://github.com/ieb/sparsemapcontent/tree/org.sakaiproject.nakamura.core-1.</a>4
Derby SPI Tag: <a href="https://github.com/ieb/sparsemapcontent/tree/org.sakaiproject.nakamura.derby-driver-10.6.2.1-1.4">https://github.com/ieb/sparsemapcontent/tree/org.sakaiproject.nakamura.derby-driver-10.6.2.1-1.4 </a>PostgreSQL SPI Tag: <a href="https://github.com/ieb/sparsemapcontent/tree/org.sakaiproject.nakamura.postgres-driver-9.0-801-1.4">https://github.com/ieb/sparsemapcontent/tree/org.sakaiproject.nakamura.postgres-driver-9.0-801-1.4 </a>MySQL SPI Tag: <a href="https://github.com/ieb/sparsemapcontent/tree/org.sakaiproject.nakamura.mysql-driver-5.1.13-1.4">https://github.com/ieb/sparsemapcontent/tree/org.sakaiproject.nakamura.mysql-driver-5.1.13-1.4 </a>Oracle SPI Tag: <a href="https://github.com/ieb/sparsemapcontent/tree/org.sakaiproject.nakamura.oracle-driver-1.4-1.4">https://github.com/ieb/sparsemapcontent/tree/org.sakaiproject.nakamura.oracle-driver-1.4-1.4</a> 
Issues Fixed: <a href="https://github.com/ieb/sparsemapcontent/issues?sort=created&amp;direction=desc&amp;state=closed&amp;page=1&amp;milestone=5">https://github.com/ieb/sparsemapcontent/issues?sort=created&amp;direction=desc&amp;state=closed&amp;page=1&amp;milestone=</a>5</pre>
<p>To use</p>
<pre>&lt;dependency&gt;
  &lt;groupId&gt;org.sakaiproject.nakamura&lt;/groupId&gt;
  &lt;artifactId&gt;org.sakaiproject.nakamura.core&lt;/artifactId&gt;
  &lt;version&gt;1.4&lt;/version&gt;    
&lt;/dependency&gt;</pre>
<p>The Jar is an <a class="zem_slink" title="OSGi" href="http://en.wikipedia.org/wiki/OSGi" rel="wikipedia">OSGi</a> bundle complete with Manifest, bundled with services. To use you will need to select a SPI implementation fragment bundle and deploy that with the core bundle. Normally this is done when the OSGi Standalone application jar is constructed. In addition to the core SparseMap bundle you will now need one of the the SPI implementation fragments.</p>
<p>Derby</p>
<pre>&lt;bundle&gt;
  &lt;groupId&gt;org.sakaiproject.nakamura&lt;/groupId&gt;
  &lt;artifactId&gt;org.sakaiproject.nakamura.derby-driver&lt;/artifactId&gt;
  &lt;version&gt;10.6.2.1-1.4&lt;/version&gt;    
&lt;/bundle&gt;</pre>
<div>
<p>PostgreSQL</p>
<pre>&lt;bundle&gt;
  &lt;groupId&gt;org.sakaiproject.nakamura&lt;/groupId&gt;
  &lt;artifactId&gt;org.sakaiproject.nakamura.postgres-driver&lt;/artifactId&gt;
  &lt;version&gt;9.0-801-1.4&lt;/version&gt;    
&lt;/bundle&gt;</pre>
<div>
<p>MySQL</p>
<pre>&lt;bundle&gt;
  &lt;groupId&gt;org.sakaiproject.nakamura&lt;/groupId&gt;
  &lt;artifactId&gt;org.sakaiproject.nakamura.mysql-driver&lt;/artifactId&gt;
  &lt;version&gt;5.1.13-1.4&lt;/version&gt;    
&lt;/bundle&gt;</pre>
<div></div>
</div>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianboston.wordpress.com/549/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianboston.wordpress.com/549/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/ianboston.wordpress.com/549/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/ianboston.wordpress.com/549/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/ianboston.wordpress.com/549/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/ianboston.wordpress.com/549/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/ianboston.wordpress.com/549/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/ianboston.wordpress.com/549/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/ianboston.wordpress.com/549/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/ianboston.wordpress.com/549/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/ianboston.wordpress.com/549/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/ianboston.wordpress.com/549/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/ianboston.wordpress.com/549/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/ianboston.wordpress.com/549/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=549&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2011/12/14/sparsemap-content-version-1-4-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>
	</item>
		<item>
		<title>OSGi and SPI</title>
		<link>http://blog.tfd.co.uk/2011/12/13/osgi-and-spi/</link>
		<comments>http://blog.tfd.co.uk/2011/12/13/osgi-and-spi/#comments</comments>
		<pubDate>Tue, 13 Dec 2011 04:21:35 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Fragment Host]]></category>
		<category><![CDATA[OpenSocial]]></category>
		<category><![CDATA[osgi]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=544</guid>
		<description><![CDATA[OSGi provides a nice simple model to build components in and the classloader policies enable reasonably sophisticated isolation between packages and versions that make it possible to consider multiple versions of an API, and implementations of those APIs within a single container. Where OSGi starts to become unstuck is for SPI or Service Provider Interfaces. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=544&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>OSGi provides a nice simple model to build components in and the classloader policies enable reasonably sophisticated isolation between packages and versions that make it possible to consider multiple versions of an API, and implementations of those APIs within a single container. Where OSGi starts to become unstuck is for SPI or Service Provider Interfaces. It&#8217;s not so much the SPI that&#8217;s a problem, rather the implementation. SPI&#8217;s normally allow a deployer to replace the internal implementation of some feature of a service. In Shindig there is a SPI for the various Social services that allow deployers to take Shindig&#8217;s implementation of OpenSocial and graft that implementation onto their existing Social graph. In other places the SPI might cover a lower level concept. Something as simple as storage. In almost all cases the SPI implementation needs some sort of access to the internals of the service that it is supporting, and that&#8217;s where the problem starts. I most of the models I have seen, OSGi bundles Export packages that represent the APIs they provide. Those APIs provide a communications conduit to the internal implementation of the services that the API describes without exposing the API. That allows the developer of the API to stabilise the API whilst allowing the implementation to evolve. The OSGi classloader policy gives that developer some certainty that well-behaved clients (ie the ones that don&#8217;t circumvent the OSGi classloader policies) wont be binding to the internals of the implementation.</p>
<p>SPIs, by contrast are part of the internal implementation. Exposing an SPI as an export from a bundle is one approach, however it would allow any client to bind to the internal workings of the Service implementation, exposed as an API and that would probably be a mistake. Normal, well-behaved clients, could easily become clients of the SPI. That places additional, unwanted burdens on the SPI interface as it can no longer be fully trusted by the consumer of the SPI or its implementation.</p>
<p>A workable solution appears to be to use OSGi Fragment bundles that bind to a Fragment Host, the Service implementation bundle containing the SPI to be implemented. Fragment bundles different to normal bundles in nature. Its probable best to think of them as a jar that gets added to the classpath of bundle identified as the Fragment Host on activation, so that the Fragment bundles contents become available to the Fragment Hosts classloader. Naturally there are some rules that need to be observed.</p>
<p>Unlike an OSGi bundle a Fragment bundle can&#8217;t make any changes to imports and exports of the Fragment Host classloader. In fact if the manifest of the fragment contains any Import-Package, or Export-Package statements, the Fragment will not be bound to the Fragment Host. The Fragment can&#8217;t perform activation and the fragment can&#8217;t provide classes in  a package that already exists in the Fragment Host bundle, although it appears that a Fragment host can provide unique resources in the same package location. This combination of restrictions cuts off almost all the possible routes for extension, converting the OSGi bundle from something that can be activated, into a simple jar on the classloaders search path.</p>
<p>There is one loophole that does appear to work. If the Fragment Host bundle specifies a Service-Component manifest entry that specifies a service component xml file that is not in the Fragment Host bundle, then that file can be provided by the Fragment bundle. If you are using the BND (or Felix Bundle plugin) tool to specify the Service-Component header, either explicitly or explicitly you will find that your route is blocked. This tool checks that any file specified exists. If the file does not exist when the bundle is being built, BND refuses to generate the manifest. There may be some logic somewhere in that decision, but I havent found an official BND way of overriding the behaviour. The solution is to ask the BND tool to put an empty Service-Component manifest header in, then merge the manifest produced with some supplied headers when the jar is constructed. This allow you to build the bundle leveraging the analysis tools within BND and have a Service-Component header that contains non-existent server component xml files.</p>
<p>On startup, if there is no Fragment bundle adding the extra service component xml file to the Fragment Host classloader, then an error is logged and loading continues. If the Fragment bundle provides the extra service component xml file, then its loaded by the standard Declarative Service Manager that comes with OSGi. In that xml file, the implementor of the SPI can specify the internal services that implement the SPI, and allow the services inside the Fragment Host to satisfy their references from those components. This way, a relatively simple OSGi Fragment bundle can be used to provide an SPI implementation that has access to the full Fragment Host bundle internal packages, avoiding exposing those SPI interfaces to all bundles.</p>
<p>In SparseMap, I am using this mechanism to provide storage drivers for several RDBMs&#8217;s via JDBC based drivers and a handful of Column DBs (Cassandra, HBase, MongoDB). The JDBC based drivers imply contain SQL and DDL configuration as well as a simple declarative service and the relevant JDBC driver jar. This is because the JDBC driver implementation is part of the Fragment Host bundle, where it lies inactive. The ColumnDB Fragment bundles all contain the relevant implementation and client libraries to make the driver work. SparseMap was beginning to be a dumping ground for every dependency under the sun. Formalising a storage SPI and extracting implementations into SPI Fragment bundles has made SpraseMap storage independently extensible without having to expose the SPI to all bundles.</p>
<p>This will be in the 1.4 release of SparseMap due in a few days. For those using SparseMap, they will have to ensure that the SPI Fragment bundle is present in the OSGi container when the SparseMap Fragment Host bundle becomes active. If its not present, the repository in SparseMap will fail to start and an error will be logged indicating that OSGI-INF/serviceComponent.xml is missing.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianboston.wordpress.com/544/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianboston.wordpress.com/544/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/ianboston.wordpress.com/544/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/ianboston.wordpress.com/544/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/ianboston.wordpress.com/544/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/ianboston.wordpress.com/544/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/ianboston.wordpress.com/544/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/ianboston.wordpress.com/544/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/ianboston.wordpress.com/544/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/ianboston.wordpress.com/544/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/ianboston.wordpress.com/544/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/ianboston.wordpress.com/544/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/ianboston.wordpress.com/544/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/ianboston.wordpress.com/544/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=544&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2011/12/13/osgi-and-spi/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>
	</item>
		<item>
		<title>Solr Search Bundle 1.3 Released</title>
		<link>http://blog.tfd.co.uk/2011/12/13/solr-search-bundle-1-3-released/</link>
		<comments>http://blog.tfd.co.uk/2011/12/13/solr-search-bundle-1-3-released/#comments</comments>
		<pubDate>Tue, 13 Dec 2011 00:28:27 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=541</guid>
		<description><![CDATA[he Solr Search v1.3 bundle developed for Nakamura has been released. This is not to be confused with Apache Solr 4. The bundle wraps a snapshot version of Apache Solr 4 at revision 1162474 and exposes a number of OSGi components that allow s SolrJ client to interact with the Solr server. In this release 2 [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=541&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>he Solr Search v1.3 bundle developed for Nakamura has been released. This is not to be confused with Apache Solr 4. The bundle wraps a snapshot version of Apache Solr 4 at revision 1162474 and exposes a number of OSGi components that allow s SolrJ client to interact with the Solr server.</p>
<p>In this release 2 bugs were identified and fixed. These bugs relate to the reliability of remote servers in a Solr cluster and the reliability of the indexing queues.</p>
<p>As always, thanks goes to everyone who contributed and helped to get this release out.</p>
<pre>Issues Fixed: https://github.com/ieb/solr/issues?sort=created&amp;direction=desc&amp;state=closed&amp;page=1&amp;milestone=4
Release Tag: https://github.com/ieb/solr/tree/org.sakaiproject.nakamura.solr-1.3</pre>
<p>Downloads are available from the release tag.</p>
<p>To Use from a maven2 project</p>
<pre> 
    &lt;dependency&gt;
        &lt;groupId&gt;org.sakaiproject.nakamura&lt;/groupId&gt;
        &lt;artifactId&gt;org.sakaiproject.nakamura.solr&lt;/artifactId&gt;
        &lt;version&gt;1.3&lt;/version&gt;
    &lt;/dependency&gt;</pre>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianboston.wordpress.com/541/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianboston.wordpress.com/541/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/ianboston.wordpress.com/541/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/ianboston.wordpress.com/541/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/ianboston.wordpress.com/541/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/ianboston.wordpress.com/541/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/ianboston.wordpress.com/541/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/ianboston.wordpress.com/541/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/ianboston.wordpress.com/541/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/ianboston.wordpress.com/541/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/ianboston.wordpress.com/541/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/ianboston.wordpress.com/541/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/ianboston.wordpress.com/541/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/ianboston.wordpress.com/541/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=541&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2011/12/13/solr-search-bundle-1-3-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>
	</item>
		<item>
		<title>Minimalism</title>
		<link>http://blog.tfd.co.uk/2011/11/25/minimalist/</link>
		<comments>http://blog.tfd.co.uk/2011/11/25/minimalist/#comments</comments>
		<pubDate>Fri, 25 Nov 2011 01:28:21 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Apache Felix]]></category>
		<category><![CDATA[Java API for RESTful Web Services]]></category>
		<category><![CDATA[JAX-RS]]></category>
		<category><![CDATA[osgi]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=539</guid>
		<description><![CDATA[In spare moments between real work, I&#8217;ve been experimenting with a light weight content server for user generated content. In short, that means content in a hierarchical tree that is shallow and very wide. It doesn&#8217;t preclude deep narrow trees, but wide and shallow is what it does best. Here are some of the things I wanted [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=539&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>In spare moments between real work, I&#8217;ve been experimenting with a light weight content server for user generated content. In short, that means content in a hierarchical tree that is shallow and very wide. It doesn&#8217;t preclude deep narrow trees, but wide and shallow is what it does best. Here are some of the things I wanted to do.</p>
<p>I wanted to support the same type of RESTfull interface as seen in Sakai OAE&#8217;s Nakamura and standards like Atom. By that I mean where the URL points to a resource, and actions expressed by the http methods, http protocol and markers in the URL modify what a RESTfull request does. In short, along the lines of the arguments in <a href="http://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven">http://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven</a> which probably drove the thinking behind <a class="zem_slink" title="Apache Sling" href="http://en.wikipedia.org/wiki/Apache_Sling" rel="wikipedia">Sling</a> on which Nakamura is based. I mention Atom, simply because when you read the standard  it talks about the payload of a response, but makes no mention of how the URL should be structured to get that payload. It reinforces the earlier desire.</p>
<p>I wanted the server to start as quickly as possible, and use as little memory  as possible. Ideally &lt; 10s and &lt; 20MB. Java applications have got a bad name for bloat but there is no reason they have to be huge to serve load. Why so small (in Java terms)? Why not, contrary to what most apps appear to do, memory is not there to waste?</p>
<p>I wanted the core server to be support some standard protocols. eg <a class="zem_slink" title="WebDAV" href="http://en.wikipedia.org/wiki/WebDAV" rel="wikipedia">WebDav</a>, but I wanted to make it easy to extend. <a class="zem_slink" title="Java API for RESTful Web Services" href="http://en.wikipedia.org/wiki/Java_API_for_RESTful_Web_Services" rel="wikipedia">JAX-RS</a> (RestEasy) inside <a class="zem_slink" title="OSGi" href="http://en.wikipedia.org/wiki/OSGi" rel="wikipedia">OSGi</a> (Minimal Sling bootstrap + <a class="zem_slink" title="Apache Felix" href="http://en.wikipedia.org/wiki/Apache_Felix" rel="wikipedia">Apache Felix</a>)</p>
<p>I wanted the request processing to be efficient. Stream all requests (commons-upload 1.2.1 with streaming, no writing to intermediate file or byte[] all of which involve high GC traffic and slow processing), all things only processed once and available via an Adaptable pattern, a concept strong in Sling. And requests handled by response objects, not servlets. Why ? So the response state can be thread unsafe, so a request can be suspended in memory and unbound from the thread. And the resolution binding requests to resources to responses to be handled entirely in memory by pointer avoiding iteration. Ok so the lookup of a resource might go through a cache, but the resolution through to resource is an in memory pointer operation.</p>
<p>Where content is static, I wanted to keep it static. OS&#8217;s have file systems that are efficient at storing files, efficient at loading those file from disk and eliminating disk access completely, so if the bulk of the static files that my application needs really are static, why not use the filesystem. Many applications seem to confuse statically deterministic and dynamic. If the all possibilities of can be computed at build time, and the resources requires to create and serve are not excessive, then the content is static. Whats excessive ? A production build that takes 15 minutes to process all possibilities once a day is better than continually wasting heat and power doing it all the time. I might be a bit more extreem in that view accepting that filling a TB disk with compiled state is better than continually rebuilding that state incrementally in user facing production requests. If a deployer wants to do something special (SAN, NAS, something cloud like) with that filesystem there are plenty of options. All of Httpd/Tomcat/Jetty are capable of serving static files in high 1000s of requests per second concurrent, so why not use that ability. Browser based apps need every bit of speed they can get for static data.</p>
<p>The downside of all of this minimalism is a server that doesn&#8217;t have lots of ways of doing the same thing. Unlike Nakamura, you can&#8217;t write JSPs or JRuby servlets. It barely uses the OSGi Event system and has none of the sophistication of Jackrabbit. The core container is Apache Felix with the the Felix HttpSerivice running a minimalist Jetty. The Content System is Sparse Content Map, the search component is Solr as an OSGi bundle. Webdav is provided by Milton and Jax-RS by RestEasy. Cacheing is provided by <a class="zem_slink" title="Ehcache" href="http://en.wikipedia.org/wiki/Ehcache" rel="wikipedia">EhCache</a>. It starts in 8Mb in 12s, and after load drops back to about 10MB.</p>
<p>Additional RESTfull services are creating in one of three ways.</p>
<ol>
<li>Registering a servlet with the Felix Http Service (whiteboard), which binds to a URL, breaking the desire that nothing should bind to fixed URLs.</li>
<li>Creating a component that provides a marker service, picked up by the OSGi extension to RestEasy that registers that service as a JAX-RS bean.</li>
<li>Creating a factory service that emits JAX-RS annotated classes that act as response objects. The factory is annotated with the type of requests it can deal with, and the response objects tell JAX-RS what they can do with the request. The annotations are discovered when the factory is registered with OSGi, and those annotations are compiled into a one step memory lookup. (single concurrent hashmap get)</li>
</ol>
<p>Methods 1 and 2 have complete control over the protocol and are wide open to abuse, method 3 follows a processing pattern closely related to Sling.</p>
<h2>Integration testing</h2>
<p>Well unit testing is obvious, we do it and we try and get 100% coverage of every use case that matters. In fact, if you work on a time an materials basis for anyone, you should read your contract carefully to work out if you have to fix mistakes at your own expense. If you do, then you will probably start writing more tests to prove your client that what you did works. Its no surprise, in other branches of Engineering, that acceptance testing is part of many contracts. I dont think an airline would take delivery of a new plane without starting the engines, or a shipping line take delivery of a super tanker without checking it floats. I am bemused that software engineers often get away with saying &#8220;its done&#8221;, when clearly its not. Sure we all make mistakes, but delivering code without test coverage is like handing over a ship that sinks.</p>
<p>Integration testing is less obvious. In Sling there is a set of integration tests that test just about everything against a running server. Its part of the standard build but lives in its one project. Its absolutely solid and ensures that nothing builds that is broken, but as an average mortal, I found it scary since when thing did break I had to work hard to find out why. Thats why in Nakamura we wrote all integration tests in scripts. Initially bash and perl then later Ruby. With hindsight this was a huge mistake. First, you had to configure your machine to run Ruby and all the extensions needed. Not too hard on Linux, but for a time, those on OSX would wait forever for ports to finish building some base library. Dependencies gone mad. Fine if you were one of the few who created the system and pulled everything in over many months, but hell for the newcomer. Mostly, the newcomer walks away, or tweets something that everyone ignores.</p>
<p>The devs also get off the hook. New ones dont know where to write the tests, or have to learn Ruby (replace Ruby with whatever the script is). Old devs can sweep them under the carpet and when it gets to release time ignore the fact that 10% of the tests are still broken&#8230; because the didn&#8217;t have time to maintain them 3 fridays ago at 18:45, just before they went to a party. The party where they zapped 1% of their brain cells including the ones that were remembering what they should have done at 18:49. Still they had a good time, the evening raised their morale, started a great weekend ready for the next week and besides, they had no intention of boarding the ship.</p>
<p>So the integration testing here is done as java unit tests. If this was a c++ project they would be c++ unit tests. They are in the bundle where where the code they test is. They are run by &#8220;mvn -Pintegration test&#8221;. Even the command says what is going to happen. It starts a full instance of the server (now 12s becomes an age), or uses one thats already running and runs the tests.  If your in eclipse, they can be run in eclipse, just as another test might, and being OSGi, the new code in the bundle can be redeployed to the running OSGi container. That way the dev creating the bundle can put their tests in their bundle and do integration testing with the same tools they did unit testing. No excuse. &#8220;find . -type d  -name integration | grep src/test  &#8221; finds  all integration tests, and by omission ships that sink.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianboston.wordpress.com/539/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianboston.wordpress.com/539/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/ianboston.wordpress.com/539/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/ianboston.wordpress.com/539/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/ianboston.wordpress.com/539/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/ianboston.wordpress.com/539/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/ianboston.wordpress.com/539/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/ianboston.wordpress.com/539/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/ianboston.wordpress.com/539/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/ianboston.wordpress.com/539/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/ianboston.wordpress.com/539/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/ianboston.wordpress.com/539/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/ianboston.wordpress.com/539/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/ianboston.wordpress.com/539/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=539&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2011/11/25/minimalist/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>
	</item>
		<item>
		<title>Sparse Map Content 1.3 Released</title>
		<link>http://blog.tfd.co.uk/2011/11/21/sparse-map-content-1-3-released/</link>
		<comments>http://blog.tfd.co.uk/2011/11/21/sparse-map-content-1-3-released/#comments</comments>
		<pubDate>Mon, 21 Nov 2011 01:37:31 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=537</guid>
		<description><![CDATA[Sparse Map version 1.3 has been tagged (org.sakaiproject.nakamura.core-1.3) and released. Downloads of the source tree in Zip and TarGZ form are available from GitHub. In this release 8 issues were addressed, the details are in the issue tracker.  If you find any issues, please mention them to me or, better still, add an issue to the issue tracker. Unless otherwise stated the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=537&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Sparse Map version 1.3 has been tagged (<a href="https://github.com/ieb/sparsemapcontent/tree/org.sakaiproject.nakamura.core-1.3">org.sakaiproject.nakamura.core-1.</a>3) and released. Downloads of the source tree in <a href="https://github.com/ieb/sparsemapcontent/zipball/org.sakaiproject.nakamura.core-1.3">Zip</a> and <a href="https://github.com/ieb/sparsemapcontent/tarball/org.sakaiproject.nakamura.core-1.3">TarGZ</a> form are available from GitHub.</p>
<p>In this release 8<a href="https://github.com/ieb/sparsemapcontent/issues?sort=created&amp;direction=desc&amp;state=closed&amp;page=1&amp;milestone=4"> issues</a> were addressed, the details are in the <a href="https://github.com/ieb/sparsemapcontent/issues">issue tracker</a>.  If you find any issues, please mention them to me or, better still, add an issue to the issue tracker. Unless otherwise stated the license is Apache 2. Thanks to everyone who made this release possible.</p>
<pre>Tag:  <a href="https://github.com/ieb/sparsemapcontent/tree/org.sakaiproject.nakamura.core-1.3">https://github.com/ieb/sparsemapcontent/tree/org.sakaiproject.nakamura.core-1.</a>3
Issues Fixed: <a href="https://github.com/ieb/sparsemapcontent/issues?sort=created&amp;direction=desc&amp;state=closed&amp;page=1&amp;milestone=4">https://github.com/ieb/sparsemapcontent/issues?sort=created&amp;direction=desc&amp;state=closed&amp;page=1&amp;milestone=</a>4</pre>
<p>To use</p>
<pre>&lt;dependency&gt;
  &lt;groupId&gt;org.sakaiproject.nakamura&lt;/groupId&gt;
  &lt;artifactId&gt;org.sakaiproject.nakamura.core&lt;/artifactId&gt;
  &lt;version&gt;1.3&lt;/version&gt;    
&lt;/dependency&gt;</pre>
<p>The Jar is an OSGi bundle complete with Manifest, bundled dependencies and services, ready for use in Apache Felix.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianboston.wordpress.com/537/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianboston.wordpress.com/537/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/ianboston.wordpress.com/537/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/ianboston.wordpress.com/537/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/ianboston.wordpress.com/537/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/ianboston.wordpress.com/537/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/ianboston.wordpress.com/537/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/ianboston.wordpress.com/537/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/ianboston.wordpress.com/537/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/ianboston.wordpress.com/537/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/ianboston.wordpress.com/537/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/ianboston.wordpress.com/537/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/ianboston.wordpress.com/537/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/ianboston.wordpress.com/537/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=537&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2011/11/21/sparse-map-content-1-3-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>
	</item>
		<item>
		<title>Clustering Sakai OAE: Part II of ?</title>
		<link>http://blog.tfd.co.uk/2011/11/18/clustering-sakai-oae-part-ii-of-n/</link>
		<comments>http://blog.tfd.co.uk/2011/11/18/clustering-sakai-oae-part-ii-of-n/#comments</comments>
		<pubDate>Fri, 18 Nov 2011 04:17:29 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=533</guid>
		<description><![CDATA[Part II : Solr Clustering This Post carries on from my previous post about clustering Sakai OAE at Charles Sturt University here in Australia. The work done recently has focused on two issues. Firstly a cluster of app server nodes to run against a cluster of Solr nodes, and secondly  ensuring that the UI doesnt [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=533&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<h2>Part II : Solr Clustering</h2>
<p>This Post carries on from my previous post about clustering Sakai OAE at Charles Sturt University here in Australia. The work done recently has focused on two issues. Firstly a cluster of app server nodes to run against a cluster of Solr nodes, and secondly  ensuring that the UI doesnt flinch of the Solr nodes disappear. There is one caveat there, the UI can flinch if there are no Solr servers alive, but the user should not notice if we bring solr servers down, even if its the master Solr server.</p>
<h2>Solr in clusters and in Clouds</h2>
<p>Anyone wanting to know in detail how standard Solr clusters or clouds work should go and read the Solr Cloud documentation. Solr clusters on a number of levels, but for the scale of indexes that OAE is going to need to support initially we are unlikely to need much more than basic clustering. By basic clustering I mean using a cluster of Solr servers that each have a copy of the index, and are able to service queries. The next level of clustering up would be to start sharding the Solr index so that we have  sub clusters of Solr servers operating on shards or the entire index. The configuration of both of these is described in the Solr Cloud documentation using ZooKeeper to manage the cluster. At CSU we are using VM&#8217;s managed by PuppetD, simple because that integrates with the current architecture. One think I should mention at this stage, as the managers head for the hills with their hands in the air at all this complexity&#8230;. these are standard Solr servers, out of the box, running in Tomcat with only mild configuration. Easy to do in 15 minutes. We are not talking about deploying OAE proprietary code here.</p>
<h2>Query operations</h2>
<p>To make the a cluster of Solr servers take the load of many app server nodes there needs to be load balancing and automatic failover at the app server nodes. CSU uses hardware load balancers from a well known manufacturer that are perfectly capable to performing application layer load balancing, however not all deployment sites have that capability, so we are using the LBHttpSolrServer, configured with a large list of potential Solr servers. The ones that are not up and running are not used which gives us the ability to add more, at will, to take increased app server load. We can, to a degree, scale elastically. We could make this dynamic through ZooKepper or some other lookup mechanism.  There are a number of available LB algorithms with this SolrJ client and in tests we have done adding and removing Solr servers from the cluster goes completely noticed at the UI.</p>
<p>That was the easy part, now the hard part. OAE uses Solr in a close to real time mode. We are not working on indexing a large corpus of slow changing or immutable objects. We are indexing user generated content and we think that our users expect their content to appear in search results immediately. Anyone who knows how non trivial inverted indexes operate will know that, upto a point, that is possible on a single JVM, but replicating that behaviour over a cluster is unrealistic. A highly distributed transactional model, where on commit, everything everywhere is in the same state (+- 1-2ms), becomes hard to deliver as well as providing responsiveness and scalability. Certainly near zero latency between a users actions and that update appearing in all indexes is resource intensive and hard to achieve. So we have adopted a Just in time model. When a user updates something, the data reaches the location where the user is going to look for it, just before they look for it, ie just-in-time. The question that now exists is how much time do we have?  I like to call it Quality of Update Service. UI folks want no more than a few ms, so the next Ajax request finds the data, and that puts a pressure on the application. A simple application, operating on a single JVM can use the Solr index to publish all data everywhere in the index instantly. Solr4 helps us here since it can do soft commits delivering near real time indexing. However to get any sort of scalability and reliability in the application, we can&#8217;t operate in a single memory space. The application needs to publish data that it knows will be needed in a few ms to the locations where its needed, achieving the QoUS required. Other locations can be updated more leisurely. That was one of the reasons for moving from a single priority queue in the Solr bundle 1.1 release to a multiple priority queue in the Solr bundle 1.2 release. That improvement doesn&#8217;t help the fundamental problem (bear with me, I will get to the point soon) that is not present in a single JVM cluster. Writing to a large distributed inverted index, so that all views of that index are consistent, is not instant. Certainly not in Solr4. So the application, that wants to scale and have no single point of query or failure, must accept that when an update is made, not all dependencies of that data will be updated instantly. OAE is not there yet. It probably demands that OAE looks at what data needs to be where and when. Data with a high QoUS (ie low latency) should be explicitly published within the request cycle. Data with low QoUS should be indexed asynchronously. The request accessing the published data will need to convert from queries through the bitmap of the inverted index into get by pointer operations. This will enable the application to break out of the bottleneck that it been presented with (finally the point). A Solr index only allows a single writer. You cant update the same document in the same Solr core from multiple JVMs or VMs. AFAIK merging commits from multiple segments from different Solr servers on independent disconnected timelines into a single index core is not supported. There is a hard 1:1 relationship between the index writer and the core being updated. So if you lose the process managing the update operation, you loose the ability to update the core.</p>
<p>The Solr bundle used in OAE addresses this by queuing update events in a persistent disk based transactional queue. When the Solr master, the only Solr instance managing the Solr core in question dies, events remain in the queue pending being processed. When the master comes back, the queue restarts update operations and the data that was changed enters the index. This is not rocket science, its very simple, but it does require that the UI isn&#8217;t expecting anything in that queue to be available on the next Ajax request. AFAIK, OAE is not in that place yet.</p>
<h2>CSU&#8217;s OAE/Solr Cluster</h2>
<p>So back to the clustering. We have a cluster up and running at CSU. You will remember from Part I, that bouncing app server nodes works. A UI user doesn&#8217;t know or notice their state is wandering about the cluster as app server nodes go up and down.  All the queries that app server makes to the Solr server are load balanced and the LB algorithms inside the LBHttpSolrClient that comes with Solr instantly recognises when a Solr slave dies or becomes available. So the users have no idea what might be happening in the underlying infrastructure. When the master Solr instance, off which all the slaves are feeding, dies, indexing operations pause, queuing up events on each app server node. When the Solr master comes back, indexing restarts. The write operation to the queue is concurrent and thread safe allowing both synchronous and asynchronous notification events to be persisted in the queue. That write operation is also detached from the indexing operation, so when indexing stops, the application server continues as if nothing had happened. Provided the UI does not place great dependence on data flowing through this route, no user will notice that the Solr master has been taken offline. As any ops person will tell you, it was taken of line&#8230; it didn&#8217;t just die. Just remember to ask, what took it offline?</p>
<h2>Other issues encountered</h2>
<p>Prior to release 1.2 and the work at CSU we were using the StreamingSolrUpdateServer. This is an efficient implementation of the SolrJ client that has a memory queue, and multiple queue processor threads. That queue and pool of workers allows the client to interleave network latency parallelizing multiple concurrent update operations and allowing those to be managed by the worker threads. Only when the client commits does the main client thread block while all previous update threads complete and the queue is emptied. There are several issues here. In the current Solr4 implementation, errors on operations performed by the pool of threads are not communicated back to the main client thread. Hence it has no way of knowing when a remote update server has failed. Only when a commit is performed does the main client thread know there was a problem. This is not too much of a problem for when the server goes down, since the commit will be the last operation and hence rollback the entire update transaction in the client. It also doesn&#8217;t matter that the in SolrJ client memory queue is lost, since the on disk queue never gets committed and the document IDs are immutably bound to the content they represent, hence indexing twice does no harm. The problem comes when the queue restarts. The client only knows that the commit completed. It doesn&#8217;t know how many of the index update operation performed in that transaction where sucessfull, hence with the StreamingSolrUpdateServer we find that  on restart partial update transactions get through to the index. Switching to the ComonHttpSolrServer, which uses the client thread for all operations addresses this issue.</p>
<p>One of the other issues that appeared was that with more than one queue, sharing the same update SolrJ client, commits on one thread would commit operations on the other thread. The classes are thread safe but the transactions inside the SolrJ client are not bound to a transaction context and so the SolrJ clients can&#8217;t be shared between transactions. We now bind SolrJ clients to transaction contexts.</p>
<p>The work doen at CSU will make it into the Solr bundle in the 1.3 release, and any changes that were made to the core code will undoubtedly make it into the managed project code base. A selection of lead developers from the managed project have access to the CSU private repository.</p>
<p><a href="http://ianboston.files.wordpress.com/2011/11/csucluster.png"><img class="alignnone size-full wp-image-534" title="CSUCluster" src="http://ianboston.files.wordpress.com/2011/11/csucluster.png?w=510&#038;h=619" alt="" width="510" height="619" /></a></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianboston.wordpress.com/533/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianboston.wordpress.com/533/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/ianboston.wordpress.com/533/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/ianboston.wordpress.com/533/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/ianboston.wordpress.com/533/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/ianboston.wordpress.com/533/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/ianboston.wordpress.com/533/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/ianboston.wordpress.com/533/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/ianboston.wordpress.com/533/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/ianboston.wordpress.com/533/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/ianboston.wordpress.com/533/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/ianboston.wordpress.com/533/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/ianboston.wordpress.com/533/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/ianboston.wordpress.com/533/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=533&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2011/11/18/clustering-sakai-oae-part-ii-of-n/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>

		<media:content url="http://ianboston.files.wordpress.com/2011/11/csucluster.png" medium="image">
			<media:title type="html">CSUCluster</media:title>
		</media:content>
	</item>
		<item>
		<title>Solr Search Bundle 1.2 released</title>
		<link>http://blog.tfd.co.uk/2011/11/18/solr-search-bundle-1-2-released/</link>
		<comments>http://blog.tfd.co.uk/2011/11/18/solr-search-bundle-1-2-released/#comments</comments>
		<pubDate>Fri, 18 Nov 2011 02:21:42 +0000</pubDate>
		<dc:creator>Ian</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=530</guid>
		<description><![CDATA[The Solr Search v1.2 bundle developed for Nakamura has been released. This is not to be confused with Apache Solr 4. The bundle wraps a snapshot version of Apache Solr 4 at revision 1162474 and exposes a number of OSGi components that allow s SolrJ client to interact with the Solr server. This release fixes a [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=530&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>The Solr Search v1.2 bundle developed for Nakamura has been released. This is not to be confused with Apache Solr 4. The bundle wraps a snapshot version of Apache Solr 4 at revision 1162474 and exposes a number of OSGi components that allow s SolrJ client to interact with the Solr server.</p>
<p>This release fixes a number of bugs related to concurrency and and the indexing operation identified by moving from a single threaded indexing operation to a muli threaded indexing operation. These bugs were introduced in the previous release. These were introduced by sharing a StreamingSorUpdateServer between multiple threads. Although the class is thread safe and after about June 2011 it does not hang, it does contain an internal memory based queue that asyncronously sends updates to a remote server. I should state, and you guessed it, that this only impacts situations where the Solr server being updated is not in the same JVM. The problem is that should any of the updates fail, no communication of that fact propagates back to the thread that performed the update operation. In the case of the Solr bundle we attempt to make the indexing queue reliable with a transactional, persistent queue. However, since we dont know if the update operation failed, we have no chance of working out what to do with the batch of updates being processed. This release fixes those issues. It also fixes a number of clustering and failover issues discovered at Charles Sturt University which I will leave for a follow up post.</p>
<p>Other improvements are listed with the issues fixed against this version, link below. As always, thanks goes to everyone who contributed and helped to get this release out.</p>
<pre>Issues Fixed: https://github.com/ieb/solr/issues?sort=created&amp;direction=desc&amp;state=closed&amp;page=1&amp;milestone=3
Release Tag: https://github.com/ieb/solr/tree/org.sakaiproject.nakamura.solr-1.2</pre>
<p>Downloads are available from the release tag.</p>
<p>To Use from a maven2 project</p>
<pre> 
    &lt;dependency&gt;
        &lt;groupId&gt;org.sakaiproject.nakamura&lt;/groupId&gt;
        &lt;artifactId&gt;org.sakaiproject.nakamura.solr&lt;/artifactId&gt;
        &lt;version&gt;1.2&lt;/version&gt;
    &lt;/dependency&gt;</pre>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/ianboston.wordpress.com/530/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/ianboston.wordpress.com/530/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/ianboston.wordpress.com/530/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/ianboston.wordpress.com/530/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/ianboston.wordpress.com/530/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/ianboston.wordpress.com/530/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/ianboston.wordpress.com/530/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/ianboston.wordpress.com/530/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/ianboston.wordpress.com/530/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/ianboston.wordpress.com/530/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/ianboston.wordpress.com/530/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/ianboston.wordpress.com/530/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/ianboston.wordpress.com/530/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/ianboston.wordpress.com/530/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=blog.tfd.co.uk&amp;blog=6575768&amp;post=530&amp;subd=ianboston&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://blog.tfd.co.uk/2011/11/18/solr-search-bundle-1-2-released/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/b4c84c66ffbbb824b5ecf24362318554?s=96&#38;d=&#38;r=G" medium="image">
			<media:title type="html">ian</media:title>
		</media:content>
	</item>
	</channel>
</rss>
