<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments for Timefields</title>
	<atom:link href="http://blog.tfd.co.uk/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.tfd.co.uk</link>
	<description>Open Source Open Thought</description>
	<lastBuildDate>Fri, 10 Feb 2012 00:57:04 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>Comment on Access Control Lists in Solr/Lucene by Ian</title>
		<link>http://blog.tfd.co.uk/2012/02/08/access-control-lists-in-solrlucene/#comment-1337</link>
		<dc:creator><![CDATA[Ian]]></dc:creator>
		<pubDate>Fri, 10 Feb 2012 00:57:04 +0000</pubDate>
		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=567#comment-1337</guid>
		<description><![CDATA[Hey Chuck, thanks for the feedback, I like the last line, made be laugh only because I know its so untrue. 

Sharding based on terms will help, in the sense that it minimises the cardinality of the inverted index, however there is a good presentation at ElasticSearch why term sharding has problems. Its hard to balance which kills gather-scatter (or map reduce) performance, and the network bandwidth requirements are higher. 

Even so, the 3 problems here are the number of query terms, one for each principal the user has, the time it takes to process a single term in a query, and the number of principals per document. If each document had one principal the query operation would be a first order problem relative to the number of principals a user has. Since docs can have many principals its a second order problem in the index, which makes it too expensive for real time searching in real use cases. (think 1s response not 2ms). We could has all the principals together but that would raise the cardinality up to order n*(n-1)*....(n-t) where n is the total number of principals and t is the number held per user/document. n could easily be 1M and t 500 in medium size system. In practice the cardinality would be unlikely to ever reach the predicted, but its still going to be big. In Sakai 2 there was a perfect organization of content that made the responses dense. Content was organized into groups and, in general users who where members of those groups had access to all the content in those groups, so you could simply use a single read principal on each Lucene doc to generate a suitably dense result set. The case were a single user had more principals than Lucene could cope with hasn&#039;t happened (yet), and when it does, there is a simple solution. Make a special principal for that set of groups, and add it to all documents that appear in any one of those groups. Unfortunately in Sakai OAE, there is no organising principal, and hence all documents already have 10s of principals and user already have hundreds. I think most systems are more like Sakai 2 in this respect.]]></description>
		<content:encoded><![CDATA[<p>Hey Chuck, thanks for the feedback, I like the last line, made be laugh only because I know its so untrue. </p>
<p>Sharding based on terms will help, in the sense that it minimises the cardinality of the inverted index, however there is a good presentation at ElasticSearch why term sharding has problems. Its hard to balance which kills gather-scatter (or map reduce) performance, and the network bandwidth requirements are higher. </p>
<p>Even so, the 3 problems here are the number of query terms, one for each principal the user has, the time it takes to process a single term in a query, and the number of principals per document. If each document had one principal the query operation would be a first order problem relative to the number of principals a user has. Since docs can have many principals its a second order problem in the index, which makes it too expensive for real time searching in real use cases. (think 1s response not 2ms). We could has all the principals together but that would raise the cardinality up to order n*(n-1)*&#8230;.(n-t) where n is the total number of principals and t is the number held per user/document. n could easily be 1M and t 500 in medium size system. In practice the cardinality would be unlikely to ever reach the predicted, but its still going to be big. In Sakai 2 there was a perfect organization of content that made the responses dense. Content was organized into groups and, in general users who where members of those groups had access to all the content in those groups, so you could simply use a single read principal on each Lucene doc to generate a suitably dense result set. The case were a single user had more principals than Lucene could cope with hasn&#8217;t happened (yet), and when it does, there is a simple solution. Make a special principal for that set of groups, and add it to all documents that appear in any one of those groups. Unfortunately in Sakai OAE, there is no organising principal, and hence all documents already have 10s of principals and user already have hundreds. I think most systems are more like Sakai 2 in this respect.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Access Control Lists in Solr/Lucene by drchuck (@drchuck)</title>
		<link>http://blog.tfd.co.uk/2012/02/08/access-control-lists-in-solrlucene/#comment-1335</link>
		<dc:creator><![CDATA[drchuck (@drchuck)]]></dc:creator>
		<pubDate>Fri, 10 Feb 2012 00:12:17 +0000</pubDate>
		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=567#comment-1335</guid>
		<description><![CDATA[Ian, thanks for an excellent article.   I do wonder why you don&#039;t simply alter the data model in SOLR to have a simple model of principals and do the data reduction inside of SOLR.   At the kinds of data sizes and performance constraints you are talking about, it is silly to maintain the two pieces of the problem on two sides of an abstraction for purity sake.  Frankly SOLR needs AUTHZ-Aware search for *lots* of reasons - not just learning management systems.   If your answer is that the inverted index is not relational and uses some magic structures I accept that. If the inverted indexes are not relational, then your analysis suggests that this is SOLR&#039;s way of telling you that you need to shard and it is telling you what to shard on :)  If a SOLR instance can handle say 500 principals per document in one instance, then make two instances of the inverted index and once a document reaches a certain number of principals, we just add the document to the second shard and indicate that it belongs to the next 500 principles on that shard and so forth.  Use scatter/gather style across the shards, and if you need 100 results, ask for the best 100 from each shard, and do an insert sort and it is all quite tractable.   Now the question is what you do when you find that one file with 100,000 principals.   Hmmm.  Maybe sharding without adding processors.  Or perhaps just add the same document to the index twice for the second 500 principles.  If the document with the most principals has 2500 principals, then to get the top 100 results - you need to grab the top 500 results and eliminate duplicates.  I wish you could draw this on a whiteboard so I could understand it better.  Ah well.  It is cool to give architecture/scaling/performance advice when I literally *have no idea what I am talking about* :).]]></description>
		<content:encoded><![CDATA[<p>Ian, thanks for an excellent article.   I do wonder why you don&#8217;t simply alter the data model in SOLR to have a simple model of principals and do the data reduction inside of SOLR.   At the kinds of data sizes and performance constraints you are talking about, it is silly to maintain the two pieces of the problem on two sides of an abstraction for purity sake.  Frankly SOLR needs AUTHZ-Aware search for *lots* of reasons &#8211; not just learning management systems.   If your answer is that the inverted index is not relational and uses some magic structures I accept that. If the inverted indexes are not relational, then your analysis suggests that this is SOLR&#8217;s way of telling you that you need to shard and it is telling you what to shard on <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />   If a SOLR instance can handle say 500 principals per document in one instance, then make two instances of the inverted index and once a document reaches a certain number of principals, we just add the document to the second shard and indicate that it belongs to the next 500 principles on that shard and so forth.  Use scatter/gather style across the shards, and if you need 100 results, ask for the best 100 from each shard, and do an insert sort and it is all quite tractable.   Now the question is what you do when you find that one file with 100,000 principals.   Hmmm.  Maybe sharding without adding processors.  Or perhaps just add the same document to the index twice for the second 500 principles.  If the document with the most principals has 2500 principals, then to get the top 100 results &#8211; you need to grab the top 500 results and eliminate duplicates.  I wish you could draw this on a whiteboard so I could understand it better.  Ah well.  It is cool to give architecture/scaling/performance advice when I literally *have no idea what I am talking about* <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> .</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Deprecate Solr Bundle by Ian</title>
		<link>http://blog.tfd.co.uk/2012/02/02/deprecate-solr-bundle/#comment-1320</link>
		<dc:creator><![CDATA[Ian]]></dc:creator>
		<pubDate>Fri, 03 Feb 2012 22:43:05 +0000</pubDate>
		<guid isPermaLink="false">http://ianboston.wordpress.com/?p=556#comment-1320</guid>
		<description><![CDATA[Cool, thanks for the pointer.]]></description>
		<content:encoded><![CDATA[<p>Cool, thanks for the pointer.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Deprecate Solr Bundle by sematext</title>
		<link>http://blog.tfd.co.uk/2012/02/02/deprecate-solr-bundle/#comment-1319</link>
		<dc:creator><![CDATA[sematext]]></dc:creator>
		<pubDate>Fri, 03 Feb 2012 22:31:33 +0000</pubDate>
		<guid isPermaLink="false">http://ianboston.wordpress.com/?p=556#comment-1319</guid>
		<description><![CDATA[Only skimmed it just, but possibly ./src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java]]></description>
		<content:encoded><![CDATA[<p>Only skimmed it just, but possibly ./src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Deprecate Solr Bundle by Ian</title>
		<link>http://blog.tfd.co.uk/2012/02/02/deprecate-solr-bundle/#comment-1318</link>
		<dc:creator><![CDATA[Ian]]></dc:creator>
		<pubDate>Fri, 03 Feb 2012 02:30:44 +0000</pubDate>
		<guid isPermaLink="false">http://ianboston.wordpress.com/?p=556#comment-1318</guid>
		<description><![CDATA[Sorry to pester you ;),
Do you have a pointer to how thats configured ? I have googled and grepped the code base at r1162474 but there isnt anything that stands out.
Sticking on Solr would save loads of hassle and it sounds like real time replication might be enough for us.]]></description>
		<content:encoded><![CDATA[<p>Sorry to pester you <img src='http://s1.wp.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> ,<br />
Do you have a pointer to how thats configured ? I have googled and grepped the code base at r1162474 but there isnt anything that stands out.<br />
Sticking on Solr would save loads of hassle and it sounds like real time replication might be enough for us.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Deprecate Solr Bundle by sematext</title>
		<link>http://blog.tfd.co.uk/2012/02/02/deprecate-solr-bundle/#comment-1317</link>
		<dc:creator><![CDATA[sematext]]></dc:creator>
		<pubDate>Fri, 03 Feb 2012 01:57:56 +0000</pubDate>
		<guid isPermaLink="false">http://ianboston.wordpress.com/?p=556#comment-1317</guid>
		<description><![CDATA[We have customers using Solr trunk now (though not SolrCloud features).
Replication is real-time, which implies it&#039;s not based on segment replication.]]></description>
		<content:encoded><![CDATA[<p>We have customers using Solr trunk now (though not SolrCloud features).<br />
Replication is real-time, which implies it&#8217;s not based on segment replication.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Deprecate Solr Bundle by Ian</title>
		<link>http://blog.tfd.co.uk/2012/02/02/deprecate-solr-bundle/#comment-1316</link>
		<dc:creator><![CDATA[Ian]]></dc:creator>
		<pubDate>Fri, 03 Feb 2012 01:16:40 +0000</pubDate>
		<guid isPermaLink="false">http://ianboston.wordpress.com/?p=556#comment-1316</guid>
		<description><![CDATA[Thanks for taking the time to read the post and thanks for the pointer I was looking at the Old SolrCloud page &lt;a href=&quot;http://wiki.apache.org/solr/SolrCloud&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/solr/SolrCloud&lt;/a&gt; where push replication was was listed as a low priority and transaction log was missing. Good to see there on &lt;a href=&quot;http://wiki.apache.org/solr/NewSolrCloudDesign&quot; rel=&quot;nofollow&quot;&gt;http://wiki.apache.org/solr/NewSolrCloudDesign&lt;/a&gt; front and center.

Your Blog post seems to indicate that things like the transaction log are not complete and  &lt;a href=&quot;https://issues.apache.org/jira/browse/SOLR-2700&quot; rel=&quot;nofollow&quot;&gt;https://issues.apache.org/jira/browse/SOLR-2700&lt;/a&gt; is being worked on. Nice to see, as ever in Solr, that there is lots of review and verification before anyone is prepared to say its done. I feel quite safe taking 4.0-SNAPSHOTS because of that.

Do you have a feeling for when its going to be safe to use the NewSolrCloud in production for mere mortals ?

Also is the replication by pushing segments, or is it incremental per document ?]]></description>
		<content:encoded><![CDATA[<p>Thanks for taking the time to read the post and thanks for the pointer I was looking at the Old SolrCloud page <a href="http://wiki.apache.org/solr/SolrCloud" rel="nofollow">http://wiki.apache.org/solr/SolrCloud</a> where push replication was was listed as a low priority and transaction log was missing. Good to see there on <a href="http://wiki.apache.org/solr/NewSolrCloudDesign" rel="nofollow">http://wiki.apache.org/solr/NewSolrCloudDesign</a> front and center.</p>
<p>Your Blog post seems to indicate that things like the transaction log are not complete and  <a href="https://issues.apache.org/jira/browse/SOLR-2700" rel="nofollow">https://issues.apache.org/jira/browse/SOLR-2700</a> is being worked on. Nice to see, as ever in Solr, that there is lots of review and verification before anyone is prepared to say its done. I feel quite safe taking 4.0-SNAPSHOTS because of that.</p>
<p>Do you have a feeling for when its going to be safe to use the NewSolrCloud in production for mere mortals ?</p>
<p>Also is the replication by pushing segments, or is it incremental per document ?</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Deprecate Solr Bundle by sematext</title>
		<link>http://blog.tfd.co.uk/2012/02/02/deprecate-solr-bundle/#comment-1315</link>
		<dc:creator><![CDATA[sematext]]></dc:creator>
		<pubDate>Fri, 03 Feb 2012 00:22:28 +0000</pubDate>
		<guid isPermaLink="false">http://ianboston.wordpress.com/?p=556#comment-1315</guid>
		<description><![CDATA[Note that SolrCloud has the transaction log and NRT as well now.  See http://blog.sematext.com/tag/solrcloud/]]></description>
		<content:encoded><![CDATA[<p>Note that SolrCloud has the transaction log and NRT as well now.  See <a href="http://blog.sematext.com/tag/solrcloud/" rel="nofollow">http://blog.sematext.com/tag/solrcloud/</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on OSGi and SPI by Ian</title>
		<link>http://blog.tfd.co.uk/2011/12/13/osgi-and-spi/#comment-1263</link>
		<dc:creator><![CDATA[Ian]]></dc:creator>
		<pubDate>Mon, 19 Dec 2011 23:21:50 +0000</pubDate>
		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=544#comment-1263</guid>
		<description><![CDATA[I had not looked at it, thank you for the link. I agree that Fragments feel like an OSGi anti pattern, but the fundamental problem I was struggling with was how to write a SPI implementation that had access to the SPI where the SPI was not exported outside the bundle. The reason the SPI was not exported is that I did not want well behaved clients to bind to the SPI bypassing the logic in the SPI consumer. I cant do a great deal about badly behaved clients that bypass normal OSGi boundaries. IIRC, RFC 167 assumes that the consumer and provider are in separate bundles allowing any other bundle access to the SPI?

The other problem is that (IIUC) the ServiceLoader pattern was intended for simple services that either dont need to connect to the environment they are working within or explicitly have initialization as part of the SPI. Where the SPI is agnostic about initialization and context, leaning towards pure IoC like patterns it falls to the SPI implementation to acquire context through out of band or other means. What would be really nice is to see a RFC that enabled a SPI implementation to be a declaratively managed component, living in a separate bundle, where it gained access to a protected set of SPI and support classes within the SPI consumer bundle. I think I am right in saying that would require the declarative service manager to pre-bind to the SPI consumer classloader before instancing the SPI implementation, so that the SPI classes from the correct classloader were already available?

It is good to see Aries addressing these issues and looking at RFC 167 it certainly looks like its going to address the ServiceLocator pattern, which continues to be the bain of anyone integrating 3rd part components in OSGi. Its almost become a standard pattern in my code bases to set the context classloader to the bundle classloader in advance of performing initialisation of anything not designed for OSGi.]]></description>
		<content:encoded><![CDATA[<p>I had not looked at it, thank you for the link. I agree that Fragments feel like an OSGi anti pattern, but the fundamental problem I was struggling with was how to write a SPI implementation that had access to the SPI where the SPI was not exported outside the bundle. The reason the SPI was not exported is that I did not want well behaved clients to bind to the SPI bypassing the logic in the SPI consumer. I cant do a great deal about badly behaved clients that bypass normal OSGi boundaries. IIRC, RFC 167 assumes that the consumer and provider are in separate bundles allowing any other bundle access to the SPI?</p>
<p>The other problem is that (IIUC) the ServiceLoader pattern was intended for simple services that either dont need to connect to the environment they are working within or explicitly have initialization as part of the SPI. Where the SPI is agnostic about initialization and context, leaning towards pure IoC like patterns it falls to the SPI implementation to acquire context through out of band or other means. What would be really nice is to see a RFC that enabled a SPI implementation to be a declaratively managed component, living in a separate bundle, where it gained access to a protected set of SPI and support classes within the SPI consumer bundle. I think I am right in saying that would require the declarative service manager to pre-bind to the SPI consumer classloader before instancing the SPI implementation, so that the SPI classes from the correct classloader were already available?</p>
<p>It is good to see Aries addressing these issues and looking at RFC 167 it certainly looks like its going to address the ServiceLocator pattern, which continues to be the bain of anyone integrating 3rd part components in OSGi. Its almost become a standard pattern in my code bases to set the context classloader to the bundle classloader in advance of performing initialisation of anything not designed for OSGi.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on OSGi and SPI by Jeremias Märki</title>
		<link>http://blog.tfd.co.uk/2011/12/13/osgi-and-spi/#comment-1261</link>
		<dc:creator><![CDATA[Jeremias Märki]]></dc:creator>
		<pubDate>Mon, 19 Dec 2011 13:52:05 +0000</pubDate>
		<guid isPermaLink="false">http://blog.tfd.co.uk/?p=544#comment-1261</guid>
		<description><![CDATA[Have you looked at RFC 167 in the OSGi Enterprise 5 EA draft? This is being prototyped by Apache Aries: http://aries.apache.org/modules/spi-fly.html
Using fragment bundles doesn&#039;t really follow the OSGi spirit, IMO. I&#039;d want to expose plug-ins as OSGi services when working in an OSGi environment.]]></description>
		<content:encoded><![CDATA[<p>Have you looked at RFC 167 in the OSGi Enterprise 5 EA draft? This is being prototyped by Apache Aries: <a href="http://aries.apache.org/modules/spi-fly.html" rel="nofollow">http://aries.apache.org/modules/spi-fly.html</a><br />
Using fragment bundles doesn&#8217;t really follow the OSGi spirit, IMO. I&#8217;d want to expose plug-ins as OSGi services when working in an OSGi environment.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

