Vivo Harvester for Symplectic Elements

25 04 2012

I’ve been doing some work recently on a Vivo Harvester for Symplectic Elements. Vivo, as its website says, aims to build a global collaboration network for researchers. They say National, I think Global. The project hopes to do this by collecting everything know about a researcher including the links between researchers and publishing that to the web. You could say its like Facebook or G+ profiles for academics, except that the data is derived from reliable sources. Humans are notoriously unreliable when talking about themselves, and academics may be the worst offenders. That means that much of the information within a Vivo profile, and the Academic Graph that the underlying semantic web represents has been reviewed. Either because the links between individuals have been validated by co-authored research in peer reviewed journals, or because the source of the information has been checked and validated.

If I compare the Academic Graph(tm) (just claimed tm over that if no one else has) with Social Graphs like OpenSocial, then I suspect that OpenSocial covers less than 10% of ontology of an Academic Graph, and certainly a quick look at the base Vivo ontology reveals a few 100 top level object classes, each with 10s of data or object. That creates some chalenges for Vivo that have influenced its design. Unlike Twitter’s FlockDB with 1 property to each relationship, “follow”, this is a fully linked data set. Again Unlike Twitter there is no intention that a single Vivo instance, even cloud deployed, could host all researchers on a global scale. It seems that one of the standard RDF stores (Jena+SDB) is capable of at least holding the data and driving the user interface. I say holding, as since the early versions of Vivo it has used a Solr/Lucene index to provide query performance, since pure Jena queries would never keep place with all the queries required by a live application. This introduces a secondary characteristic of Vivo. The update cycle. Updates to Vivo, especially via a harvester require batched index re-builds, and may require re-computation of the RDF inferences. That places Vivo firmly in the infrequent update space which is precisely what Solr was designed for. A production instance of Vivo uses a RDF store  for URI resource reference into a a rich Achademic Graph of which some is exposed though the UI. That RDF store populates views in Solr from which much of the UI is derived. Solr becomes a rich data index.

A Vivo Harverster, of which the Symplectic Elements is just one of many, harvests data from trusted source and generates triples representing the data and its relationships. Some Harvesters, like the PubMed harvester perform that in their own ontology, whereas other Harvesters use the Vivo Ontology. The process of harvesting is to crawl the APIs of the target source recovering all information by following links. In the case of Symplectic Elements its API is an ATOM API, so parts of the Symplectic Harvester could be adapted to any ATOM based feed. The harvested information is converted into statements about each resource to represent knowledge. X is a person, Y is a peer reviewed publication, Z is a conference and A is a grant. Finally the huge bucket of statements is processed to identify statements about the same thing and compared to the existing Vivo model of knowledge. Eventually, when all the links have been matched and duplicates disambiguated that data can be ingested into the Vivo model.

With an actively managed source of information, like Symplectic Elements this is easier than it sounds since much of the disambiguation has already been done by staff at the University as part of their Research Assessment Exercise (RAE), however its graph of knowledge may still contain flailing ends, like external voluntary organisations known only by their title, hopefully no spelt in a 101 ways. Obviously Vivo can be fed by any trusted datasource including manual input.

What is the net result of this? The University running Symplectic Elements for their internal administrative processes (RAE in the UK) is able, using OpenSource software (Vivo and the Symplectic Harvester) to publish a public view of its own Academic Graph. If, as is the aim of the Vivo project and the reason for its original grant, a sufficient number of Universities and research institutions deploy public instances, then Semantic Web aggregators will be able to index the linked data within each Vivo instance to build a Global collaborative network of researchers, their interests, their funding streams and their discoveries. When Google acquired Freebase and the developer knowledge behind that company they acquired sufficient knowledge to do this on a global scale, I have heard rumors that that is what those individuals have been up to, which is why they went quiet.

That was why the Semantic Web was created for researchers in CERN all those years ago, wasn’t it ?


10 04 2012

For those that have been watching my G+ feed you will have noticed some videos being posted. Those vidoes are of the OAE 1.2 UI running on a server developed in Python using DJango. I am getting a increasing stream of questions about what it is, hence this blog post.

What is PyOAE?

PyOAE is a re-implementation of the OAE server using DJango and a fully relational database schema. I use PostgeSQL and/or SqlLite3 but it would probably wok on any RDBMS supported by DJango. The implementation uses the OAE 1.2.0 UI code (the yet to be released 1.2.0 branch) as its specification and runs that UI unmodified.

Is it a port of Nakamura to Python?

It is not a port of the Java code base called Nakamura that is the official server of OAE and shares no common code or concepts. It does not contain the SparseMap content system or a Python port of SparseMap.

When will it be released ?

It is not, yet, functionally complete and so is not ready for release. When it is, I will release it.

Will it scale ?

The short answer is I don’t have enough evidence to know at this stage. I have some circumstantial evidence and some hard evidence that suggests that there is no reason why a fully RDBMS model using DJango might not scale.

Circumstantial: DJango follows the same architectural model used by many LAMP based applications. Those have been shown to be amenable to scaling. WordPress, Wikipedia etc. In the educational space MoodleRooms scaled Moodle to 1M concurrent users with the help of MySQL and others.

Circumstantial: Sakai CLE uses relational storage and scales to the levels required to support the institutions that want to use it.

Circumstantial: DJango has been used for some large, high traffic websites.

Hard: I have loaded the database with 1M users and the response time shows no increase wrt to the number of users.

Hard: I have loaded the database with 50K users and 300K messages with no sign of increasing response time.

Hard: I have loaded the content store with 100K users and 200K content items and seen no increase in response time.

However, I have not load tested with 1000’s of concurrent users.

Will it support multi tenancy ?

Short answer is yes, in the same way that WordPress supports multi tenancy, but with the potentially to support it in other ways.

Is it complex to deploy ?

No, its simple and uses any of the deployment mechanisms recommended by DJango.

Does it use Solr?

At the moment it does not, although I havent implemented free text searching on the bodies of uploaded content. All queries are relational SQL generated by DJango’s ORM framework.

What areas does it cover?

Users, Groups, Content, Messaging, Activities, Connections are currently implemented and tested. Content Authoring  is partially supported and I have yet to look at supporting World. Its about 7K lines of python including comments covering about 160 REST endpoints with all UI content being served directly from disk. There are about 40 tables in the database.

What areas have you made different ?

Other than the obvious use of an RDBMS for storage there are 2 major differences that jumps out. The content system is single instance. ie if you upload the same content item 100 times, its only stored once and referenced 100 times. The second is that the usernames are not visible in any http response from the server making it impossible to harvest usernames from an OAE instance. This version uses FERPA safe opaque IDs. Due to some hard coded parts of the UI I have not been able to do the same for Group names which can be harvested.

Could a NoSQL store be used ?

If DJango has an ORM adapter for the NoSQL store, then yes although I havent tried, and I am not convinced its necessary for the volume of records I am seeing.

Could the same be done in Java?

Probably, but the pace of development would be considerably less (DJango development follows a edit-save-refresh pattern and requires far fewer lines of code than Java). Also, a mature framework that had solved all the problems DJango has addressed would be needed to avoid the curse of all Java developers, the temptation to write their own framework.

Who is developing it?

I have been developing it in my spare time over the past 6 weeks. It represents abut 10 days work so far.

Why did you start doing it?

In January 2012 I heard that several people whose opinion I respected had said Nakamura should be scrapped and re-written in favour of a fully relational model I wanted to find out if they were correct. So far I have found nothing that says they are wrong. My only regret is that I didn’t hear them earlier and I wish I had tried this in October 2010.

Why DJango ?

I didn’t want to write yet another framework, and I don’t have enough free time to do this in Java. DJango has proved to have everything I need, and Python has proved to be quick to develop and plenty fast enough to run OAE.

Didn’t you try GAE ?

I did try writing a Google App Engine backend for OAE, however I sound found two problems. The UI requirements for counting and querying relating to Groups were incompatible with BigStore. I also calculated a GAE hosted instance would rapidly breach the free GAE hosting limits due to the volume of UI requests. Those two factors combined to cause me to abandon a GAE based backend.


6 04 2012

The world wakes up to an OSX virus. News media jumps on the story terrifying users that they might be infected. Even though the malware that users were tricked to install may not be nice its clear from looking at the removal procedure that unlike  Windows platform were a virus normally buries itself deep within the inner workings of the OS, this trojan simply modified an XML file on disk and hence reveals its location. To be successful in doing that it would have had to persuade the user to give it elevated privileges as the file is only writable by the root user. If it failed to do that it would have infected user space.

In spite of all the hype around this infection, the root of infection shows that the underlying OS is, in it self, secure and so only as secure as the user who grants and installer elevated privileges. If, when you install software on you Mac you are not prompted for a administrative password, go and find out why before something else quietly installs itself and steals your bank details.

The files involved are the plist file (Info.plist) for any browser, so when you look to see if you have been infected, dont forget to check all browsers you use, not just Safari and Firefox. Also check Chrome.

If you are wondering if other plists are secure, many are cryptographically signed with a private key belonging to Apple. Provided that key doesn’t leak undetected those plists cant easily be compromised. For anyone who is paranoid, the standard Unix tools like tripwire would protect any unsigned plists.