Monday 26 March 2012

Assembling the Pelagios 2 Infrastructure

A short update from Pelagios WP1: WP 1 is about assembling the data infrastructure behind our project. To some extent this builds directly on the Graph Explorer - our proof-of-concept visualization interface from Pelagios 1. But, while we were able to reuse the things we've learned from the first stage of the project, we've also come to realize how the things we had previously implemented - features, data model, speed and scalability - have already been surpassed. WP 1 therefore starts off with a complete reappraisal and re-write of the central functionality. Details follow below, along with a rough outline of what's been done so far and what the next steps will be.

Consolidation


Given the need in Pelagios 1 to get a demonstration up and running quickly, the Graph Explorer ended up being rather a bit of a monolith. Key goal of the re-write has been, therefore, to introduce a better modularization of the codebase, and to consolidate the core functionality into one software library that can be more easily re-used elsewhere. We're still working on finalizing and testing this library as partners deliver updates to their data, but the essentials are finished. There's a convenient programming API to work with Pelagios's core model primitives - Datasets, GeoAnnotations and Places - in your own software. Bindings to store Pelagios data in a graph database are included, but without the hard-wired dependency that existed in the Graph Explorer. In this regard the Tinkerpop graph database abstraction framework has greatly helped to achieve good decoupling between data model and implementation classes, reduce code size by eliminating the need for much of the boilerplate code, and keep things generic: i.e. the bindings should be re-usable for a variety of graph database brands now (although some of the more advanced I/O and query functionality remains specific to Neo4j - our DB of choice for Pelagios).

Less Speed, More Memory Consumption

Or was this the other way round? Regarding our toolset to read Pelagios data into the system, we switched our underlying RDF parser from Jena to the OpenRDF Rio parser framework. This allows us to more directly hook into the RDF parsing lifecycle, and avoids the need to construct full RDF graphs in memory before we can actually work with the data. As a result, parsing is now faster and less memory intensive. (Credit goes to Arachne for letting us learn the hard way that datasets can be... LARGE.)

Getting our Feet Wet with Scala

As with Pelagios 1, the technological basis for our server-side components is still the Java Virtual Machine. This time, however, we chose to go with Scala:

  • Scala's syntax is, in general, more compact than that of Java.
  • Scala's functional aspects and comprehensive features for dealing with collections and lists are a very good fit with the things we frequently do when handling Pelagios data. Scala almost always eliminates the need for iterations and loops in those cases, and often achieves the same result with a single line of code.
  • Pattern matching has been another nice feature to make our parser classes (in particular) much more concise.
  • Last but not least: someone once suggested that as a developer, you should learn at least one new programming language every year. Although I find that advice a little fierce, new languages definitely encourage you to think about the same problems in different ways!

Next Steps

With our core library in place, we are now almost ready to replace the old Graph Explorer. While we are busy wrapping an HTTP frontend around our core library, our partners are already starting to make the most of our all-new, third Pelagios Principle: "Expose metadata about your dataset using the VoID vocabulary". But that's for another post!

No comments:

Post a Comment