Vivo Harvester for Symplectic Elements

25 04 2012

I’ve been doing some work recently on a Vivo Harvester for Symplectic Elements. Vivo, as its website says, aims to build a global collaboration network for researchers. They say National, I think Global. The project hopes to do this by collecting everything know about a researcher including the links between researchers and publishing that to the web. You could say its like Facebook or G+ profiles for academics, except that the data is derived from reliable sources. Humans are notoriously unreliable when talking about themselves, and academics may be the worst offenders. That means that much of the information within a Vivo profile, and the Academic Graph that the underlying semantic web represents has been reviewed. Either because the links between individuals have been validated by co-authored research in peer reviewed journals, or because the source of the information has been checked and validated.

If I compare the Academic Graph(tm) (just claimed tm over that if no one else has) with Social Graphs like OpenSocial, then I suspect that OpenSocial covers less than 10% of ontology of an Academic Graph, and certainly a quick look at the base Vivo ontology reveals a few 100 top level object classes, each with 10s of data or object. That creates some chalenges for Vivo that have influenced its design. Unlike Twitter’s FlockDB with 1 property to each relationship, “follow”, this is a fully linked data set. Again Unlike Twitter there is no intention that a single Vivo instance, even cloud deployed, could host all researchers on a global scale. It seems that one of the standard RDF stores (Jena+SDB) is capable of at least holding the data and driving the user interface. I say holding, as since the early versions of Vivo it has used a Solr/Lucene index to provide query performance, since pure Jena queries would never keep place with all the queries required by a live application. This introduces a secondary characteristic of Vivo. The update cycle. Updates to Vivo, especially via a harvester require batched index re-builds, and may require re-computation of the RDF inferences. That places Vivo firmly in the infrequent update space which is precisely what Solr was designed for. A production instance of Vivo uses a RDF store  for URI resource reference into a a rich Achademic Graph of which some is exposed though the UI. That RDF store populates views in Solr from which much of the UI is derived. Solr becomes a rich data index.

A Vivo Harverster, of which the Symplectic Elements is just one of many, harvests data from trusted source and generates triples representing the data and its relationships. Some Harvesters, like the PubMed harvester perform that in their own ontology, whereas other Harvesters use the Vivo Ontology. The process of harvesting is to crawl the APIs of the target source recovering all information by following links. In the case of Symplectic Elements its API is an ATOM API, so parts of the Symplectic Harvester could be adapted to any ATOM based feed. The harvested information is converted into statements about each resource to represent knowledge. X is a person, Y is a peer reviewed publication, Z is a conference and A is a grant. Finally the huge bucket of statements is processed to identify statements about the same thing and compared to the existing Vivo model of knowledge. Eventually, when all the links have been matched and duplicates disambiguated that data can be ingested into the Vivo model.

With an actively managed source of information, like Symplectic Elements this is easier than it sounds since much of the disambiguation has already been done by staff at the University as part of their Research Assessment Exercise (RAE), however its graph of knowledge may still contain flailing ends, like external voluntary organisations known only by their title, hopefully no spelt in a 101 ways. Obviously Vivo can be fed by any trusted datasource including manual input.

What is the net result of this? The University running Symplectic Elements for their internal administrative processes (RAE in the UK) is able, using OpenSource software (Vivo and the Symplectic Harvester) to publish a public view of its own Academic Graph. If, as is the aim of the Vivo project and the reason for its original grant, a sufficient number of Universities and research institutions deploy public instances, then Semantic Web aggregators will be able to index the linked data within each Vivo instance to build a Global collaborative network of researchers, their interests, their funding streams and their discoveries. When Google acquired Freebase and the developer knowledge behind that company they acquired sufficient knowledge to do this on a global scale, I have heard rumors that that is what those individuals have been up to, which is why they went quiet.

That was why the Semantic Web was created for researchers in CERN all those years ago, wasn’t it ?

Rogue Gadgets

5 01 2012

I have long thought one of the problems with OpenSocial is its openness to enable any Gadget based app anywhere. Even if there is a technical solution to the problem of a rogue App in the browser sandbox afforded by the iframe that simply defers the issue. Sure, the Gadget code that is the App, can’t escape the iframe sandbox and interfere with the in browser container or other iframe hosted apps in from the same source. Unfortunately wonderful technical solutions are of little interest to a user whose user experience if impacted by the safe but rogue app. The app may be technically well behaved, but downright offensive and inappropriate on many other levels, and this is an area which has given many institutions food for thought when considering gadget based platforms like Google Apps for Education. A survey of the gadgets that a user could deploy via an open gadget rendering endpoint reveals that many violate internal policies of most organizations. Racial and sexual equality are often compromised. Even basic decency. It’s the openness of the gadget renderer that causes the problem, in many cases when deployed, it will render anything its given. It’s not hard to find gadgets providing porn in gmodules the source of iGoogle, not exactly what an institution would want to endorse on its staff/student home pages.

For too long there has been an assumption that it’s the responsibility of the user to self police. That’s fine where the environment is offered by an organisation that can claim to be “only the messenger”, but when an environment is offered by an organization that is more than a messenger, self policing doesn’t hold water. The weakness of the OpenSocial gadget environment is its openness. It’s hard, if not impossible to control what gadgets are available and put the onus on the container to control what is loaded.

Trusting Mobile Apps

There is a parallel to this problem in the mobile device industry seen in the difference between Android and iOS. Android is open, the environment allows developers to do almost anything they like and have full access to all features of the phone. The Android Market with over 400K apps on it is often reported as being “wild west”  to quote “…Unlike Apple’s strict approval policy, the Android Market is seen a little like the Wild West of the mobile, with many applications getting through which would never make the cut on iOS…. “. That leaves the user with plenty of choice but exposed to a lot of risk. It’s spawning an industry of FUD, based on real fears and dangers generating a new revenue stream for those that profited from virus and malware explosions on PCs. This time it’s a mobile device where the user may have placed far more trust in the device than they know (money, bank details, authentication, liability), and has far less ability to do anything about it (there I go, adding to the FUD).

Don’t get me wrong, as a developer, I don’t like the iOS approval process, but I think it’s a necessary evil to ensure that those providing the market place or store know that what they are pushing onto the unsuspecting public won’t do harm. Firstly the iOS platform protects the device from the rogue developer. Secondly the approval process ensures that the app conforms to the guidelines, not eating the battery or using up all the users monthly bandwidth allowance in a day. Thirdly, although not always the case, the approval process ensures that the soft factors of the app are acceptable. I haven’t tried, but I suspect an app that worked as a terrorist bomb trigger app, and gave step by step instructions how to do it would not pass the soft factors inspection. Consequently users of the iOS platform feel that they can trust the apps they are being sold. There is no aftermarket industry in end-user protection as there is no business case to support it.

In the Gadget environment, it’s the gadget renderer that is the equivalent to the store. By rendering a gadget, the renderer is not just a “messenger” not to be blamed, it’s saying something about what its rendering. If the gadget renderer doesn’t do that, then I have to argue that you should not trust the gadget rendered. It could be pushing anything at you, you might trust it, but if it doesn’t trust what it’s sending you, how can you trust what it sends? Would you accept a package from a person in a uniform before boarding a plan, just because the uniform had a badge with the word “security” on it? No, neither would I. If they had a gun and ID, I would still ask them why I should be trusted to carry it.

OCLC WorldShare

There are some OpenSocial gadget renderers that care about their reputation. Most Libraries are considered to be trusted sources of information and OCLC with a membership of 72000 libraries, museums and archives in 170 countries has a reputation it and its membership cares about. OCLC recently launched WorldShare, an OpenSocial based platform that uses Apache Shindig to render Gadgets and provide access for those gadgets to a wealth of additional information feeds. It does not provide the container in which to mount the Gadgets but it provides a trusted and respected source of rendered Gadgets. This turns the OpenSocial model on its head. A not for profit organisation delivering access to vast stores of information via OpenSocial and the Gadget feeds. Suddenly the gadget rendered feed is the only thing that matters. The container could be provided by OCLC, but equally by members. OCLC has wisely decided to certify any gadget that it is prepared to serve. Like the iOS certification and approval process, WorldShare’s certification is based on technical and soft criteria. That process will hopefully ensure quality, add value and protect its uses from the wild west. Just as we trust our libraries to truthfully hold and classify knowledge, I hope that the WorldShare’s realisation that the vendor has a responsibility, will give as all the confidence to continue to trust OCLC as a source.

OSGi and SPI

13 12 2011

OSGi provides a nice simple model to build components in and the classloader policies enable reasonably sophisticated isolation between packages and versions that make it possible to consider multiple versions of an API, and implementations of those APIs within a single container. Where OSGi starts to become unstuck is for SPI or Service Provider Interfaces. It’s not so much the SPI that’s a problem, rather the implementation. SPI’s normally allow a deployer to replace the internal implementation of some feature of a service. In Shindig there is a SPI for the various Social services that allow deployers to take Shindig’s implementation of OpenSocial and graft that implementation onto their existing Social graph. In other places the SPI might cover a lower level concept. Something as simple as storage. In almost all cases the SPI implementation needs some sort of access to the internals of the service that it is supporting, and that’s where the problem starts. I most of the models I have seen, OSGi bundles Export packages that represent the APIs they provide. Those APIs provide a communications conduit to the internal implementation of the services that the API describes without exposing the API. That allows the developer of the API to stabilise the API whilst allowing the implementation to evolve. The OSGi classloader policy gives that developer some certainty that well-behaved clients (ie the ones that don’t circumvent the OSGi classloader policies) wont be binding to the internals of the implementation.

SPIs, by contrast are part of the internal implementation. Exposing an SPI as an export from a bundle is one approach, however it would allow any client to bind to the internal workings of the Service implementation, exposed as an API and that would probably be a mistake. Normal, well-behaved clients, could easily become clients of the SPI. That places additional, unwanted burdens on the SPI interface as it can no longer be fully trusted by the consumer of the SPI or its implementation.

A workable solution appears to be to use OSGi Fragment bundles that bind to a Fragment Host, the Service implementation bundle containing the SPI to be implemented. Fragment bundles different to normal bundles in nature. Its probable best to think of them as a jar that gets added to the classpath of bundle identified as the Fragment Host on activation, so that the Fragment bundles contents become available to the Fragment Hosts classloader. Naturally there are some rules that need to be observed.

Unlike an OSGi bundle a Fragment bundle can’t make any changes to imports and exports of the Fragment Host classloader. In fact if the manifest of the fragment contains any Import-Package, or Export-Package statements, the Fragment will not be bound to the Fragment Host. The Fragment can’t perform activation and the fragment can’t provide classes in  a package that already exists in the Fragment Host bundle, although it appears that a Fragment host can provide unique resources in the same package location. This combination of restrictions cuts off almost all the possible routes for extension, converting the OSGi bundle from something that can be activated, into a simple jar on the classloaders search path.

There is one loophole that does appear to work. If the Fragment Host bundle specifies a Service-Component manifest entry that specifies a service component xml file that is not in the Fragment Host bundle, then that file can be provided by the Fragment bundle. If you are using the BND (or Felix Bundle plugin) tool to specify the Service-Component header, either explicitly or explicitly you will find that your route is blocked. This tool checks that any file specified exists. If the file does not exist when the bundle is being built, BND refuses to generate the manifest. There may be some logic somewhere in that decision, but I havent found an official BND way of overriding the behaviour. The solution is to ask the BND tool to put an empty Service-Component manifest header in, then merge the manifest produced with some supplied headers when the jar is constructed. This allow you to build the bundle leveraging the analysis tools within BND and have a Service-Component header that contains non-existent server component xml files.

On startup, if there is no Fragment bundle adding the extra service component xml file to the Fragment Host classloader, then an error is logged and loading continues. If the Fragment bundle provides the extra service component xml file, then its loaded by the standard Declarative Service Manager that comes with OSGi. In that xml file, the implementor of the SPI can specify the internal services that implement the SPI, and allow the services inside the Fragment Host to satisfy their references from those components. This way, a relatively simple OSGi Fragment bundle can be used to provide an SPI implementation that has access to the full Fragment Host bundle internal packages, avoiding exposing those SPI interfaces to all bundles.

In SparseMap, I am using this mechanism to provide storage drivers for several RDBMs’s via JDBC based drivers and a handful of Column DBs (Cassandra, HBase, MongoDB). The JDBC based drivers imply contain SQL and DDL configuration as well as a simple declarative service and the relevant JDBC driver jar. This is because the JDBC driver implementation is part of the Fragment Host bundle, where it lies inactive. The ColumnDB Fragment bundles all contain the relevant implementation and client libraries to make the driver work. SparseMap was beginning to be a dumping ground for every dependency under the sun. Formalising a storage SPI and extracting implementations into SPI Fragment bundles has made SpraseMap storage independently extensible without having to expose the SPI to all bundles.

This will be in the 1.4 release of SparseMap due in a few days. For those using SparseMap, they will have to ensure that the SPI Fragment bundle is present in the OSGi container when the SparseMap Fragment Host bundle becomes active. If its not present, the repository in SparseMap will fail to start and an error will be logged indicating that OSGI-INF/serviceComponent.xml is missing.