Vivo Harvester for Symplectic Elements

I’ve been doing some work recently on a Vivo Harvester for Symplectic Elements. Vivo, as its website says, aims to build a global collaboration network for researchers. They say National, I think Global. The project hopes to do this by collecting everything know about a researcher including the links between researchers and publishing that to the web. You could say its like Facebook or G+ profiles for academics, except that the data is derived from reliable sources. Humans are notoriously unreliable when talking about themselves, and academics may be the worst offenders. That means that much of the information within a Vivo profile, and the Academic Graph that the underlying semantic web represents has been reviewed. Either because the links between individuals have been validated by co-authored research in peer reviewed journals, or because the source of the information has been checked and validated.

If I compare the Academic Graph(tm) (just claimed tm over that if no one else has) with Social Graphs like OpenSocial, then I suspect that OpenSocial covers less than 10% of ontology of an Academic Graph, and certainly a quick look at the base Vivo ontology reveals a few 100 top level object classes, each with 10s of data or object. That creates some chalenges for Vivo that have influenced its design. Unlike Twitter’s FlockDB with 1 property to each relationship, “follow”, this is a fully linked data set. Again Unlike Twitter there is no intention that a single Vivo instance, even cloud deployed, could host all researchers on a global scale. It seems that one of the standard RDF stores (Jena+SDB) is capable of at least holding the data and driving the user interface. I say holding, as since the early versions of Vivo it has used a Solr/Lucene index to provide query performance, since pure Jena queries would never keep place with all the queries required by a live application. This introduces a secondary characteristic of Vivo. The update cycle. Updates to Vivo, especially via a harvester require batched index re-builds, and may require re-computation of the RDF inferences. That places Vivo firmly in the infrequent update space which is precisely what Solr was designed for. A production instance of Vivo uses a RDF store for URI resource reference into a a rich Achademic Graph of which some is exposed though the UI. That RDF store populates views in Solr from which much of the UI is derived. Solr becomes a rich data index.

A Vivo Harverster, of which the Symplectic Elements is just one of many, harvests data from trusted source and generates triples representing the data and its relationships. Some Harvesters, like the PubMed harvester perform that in their own ontology, whereas other Harvesters use the Vivo Ontology. The process of harvesting is to crawl the APIs of the target source recovering all information by following links. In the case of Symplectic Elements its API is an ATOM API, so parts of the Symplectic Harvester could be adapted to any ATOM based feed. The harvested information is converted into statements about each resource to represent knowledge. X is a person, Y is a peer reviewed publication, Z is a conference and A is a grant. Finally the huge bucket of statements is processed to identify statements about the same thing and compared to the existing Vivo model of knowledge. Eventually, when all the links have been matched and duplicates disambiguated that data can be ingested into the Vivo model.

With an actively managed source of information, like Symplectic Elements this is easier than it sounds since much of the disambiguation has already been done by staff at the University as part of their Research Assessment Exercise (RAE), however its graph of knowledge may still contain flailing ends, like external voluntary organisations known only by their title, hopefully no spelt in a 101 ways. Obviously Vivo can be fed by any trusted datasource including manual input.

What is the net result of this? The University running Symplectic Elements for their internal administrative processes (RAE in the UK) is able, using OpenSource software (Vivo and the Symplectic Harvester) to publish a public view of its own Academic Graph. If, as is the aim of the Vivo project and the reason for its original grant, a sufficient number of Universities and research institutions deploy public instances, then Semantic Web aggregators will be able to index the linked data within each Vivo instance to build a Global collaborative network of researchers, their interests, their funding streams and their discoveries. When Google acquired Freebase and the developer knowledge behind that company they acquired sufficient knowledge to do this on a global scale, I have heard rumors that that is what those individuals have been up to, which is why they went quiet.

That was why the Semantic Web was created for researchers in CERN all those years ago, wasn’t it ?