Semantic Search

4 06 2006

Currently Search performs its indexing on text streams. There is a significant amount of information that can be extracted from entities, beside the simple digest of content. This includes things like the entity reference, the URL, title, description etc. There is also other information. We could create multiple indexes for this in Lucene quite easily, but it would not necessarily provide the search structure that is required. A better approach is probably going to be to represent this in RDF. So Im going to try and enhance the EntityContentProcuder with an RDF stream and place a pluggable RDF triple store underneath the search engine to operate as a secondary stream. Its quite possible that this will solve some of the search clustering problems and will certainly address the results clustering that would begin to make search really cool.



2 responses

6 06 2006
Grant Ingersoll

There are several people working on the capability of adding payloads at specific positions (terms) in the Lucene index, which, I think, would allow you to achieve what you are trying to do in Lucene (in the near future).

For more info, see the java-dev mailing list for Lucene.

6 06 2006

Thanks, I will have a look,

but will that allow triple store search with RQL or simular. I am after the clustering, but I am also after discovery based on two things. A) The Term vectors and B) the RDF. This will allow an ontology known at index time and an ontology known post index to present some discovery arcs. This approch is sumular to that bound in Simile Longwell2 and Piggybank (, although I think that the Analysis mechanism in used there is a gramatical stemmer, rather than an ontology based theasaurus.

%d bloggers like this: