Using an existing website as a queryable low-cost LOD publishing interface

In reply to

Abstract

Maintaining an Open Dataset comes at an extra recurring cost when it is published in a dedicated Web interface. As there is not often a direct financial return from publishing a dataset publicly, these extra costs need to be minimized. Therefore we want to explore reusing existing infrastructure by enriching existing websites with Linked Data. In this demonstrator, we advised the data owner to annotate a digital heritage website with JSON-LD snippets, resulting in a dataset of more than three million triples that is now available and officially maintained. The website itself is paged, and thus hydra partial collection view controls were added in the snippets. We then extended the modular query engine Comunica to support following page controls and extracting data from HTML documents while querying. This way, a SPARQL or GraphQL query over multiple heterogeneous data sources can power automated data reuse. While the query performance on such an interface is visibly poor, it becomes easy to create composite data dumps. As a result of implementing these building blocks in Comunica, any paged collection and enriched HTML page now becomes queryable by the query engine. This enables heterogenous data interfaces to share functionality and become technically interoperable.

This is a print-version of a paper first written for the Web. The Web-version is available at https:/​/​github.com/brechtvdv/Article-Using-an-existing-website-as-a-queryable-low-cost-LOD-publishing-interface.

Introduction

The Flemish Institute for Audiovisual Archiving (VIAA) is a non-profit organization that preserves petabytes of valuable image, audio or video files. These files and accompanying metadata are covered by distinct licenses, but some can be made accessible under an Open Data license. One initiative is opening up historical newspapers of the first World War with the open platform hetarchief.be. In 2017, the raw data of these newspapers have been published as a Linked Open Data [1] (LOD) dataset using the low-cost Triple Pattern Fragments [2] (TPF) interface. Although this interface is still accessable (http:/​/​linkeddatafragments-qas.viaa.be/), no updates from the website have been exported to the TPF interface due to absence of automatisation.

Fig. 1: The famous newspaper ‘Wipers Times’ published on 26th February 1916 (source: hetarchief.be).

Maintaining an up-to-date LOD interface brings besides technical resources also an organizational challenge. Content editors often work in a seperate environment such as a Content Management System (CMS) to update a website. The raw data gets exported from that system and published in a dedicated environment leaving the source of truth to the CMS. The question rises whether the data can be published closer to the authorative source in a more sustainable way.

Website maintainers are currently using JSON-LD structured data snippets to attain better search result ranking and visualisation with search engines. These snippets are script tags inside a HTML webpage containing Linked Data in JSON format (JSON-LD) complaint with the Schema.org [3] datamodel. Not only search engine optimization (SEO), but also standard LOD publishing is possible. The data should be representative with the main content of the subject page, such as newspaper metadata, to be aligned with the structured data guidelines of search engines. Having n subject pages leads to n Linked Data Fragments that need to be linked together through hypermedia controls [4] so the website can be consumed as a dataset.

Comunica [5] is a Linked Data user agent that can run federated queries over several heterogeneous Web APIs, such as data dumps, SPARQL-endpoints, Linked Data documents and Triple Pattern Fragments. This engine has been developed to make it easy to plug in specific types of functionality as separate modules. Such modules can be added or removed depending on the configuration. As such, by looking for affordances in Web APIs more intelligent user agents can be created.

First we give a short background of the Comunica tool and the Hydra partial collection views. We then describe how hetarchief.be is enriched with JSON-LD snippets. Next, we explain how we allow Comunica to query over this and other sources by adding two building blocks. After this, we demonstrate how a custom data dump can be created by an end-user that wants to further analyze this data, for instance in spreadsheet software. The online version of this paper embeds this demo and can be tested live. Finally, we conclude the demonstrator with a discussion and perspective on future work.

Background

Comunica

Every piece of functionality in Comunica can be implemented as seperate building blocks based on the actor programming model, where each actor can respond to a specific action. Actors that share same functionality, but with different implementations, can be grouped with a communication channel called a bus. Interaction between actors is possible through a mediator that wraps around a bus to get an action’s result from a single actor. This result depends on the configuration of the mediator, e.g. a race mediator will return the response of the actor that is able to reply the earliest.

Hydra partial collection views

Open Data is filled with collections of members (hotel ammenities, road works etc.). These related resources can be grouped as members of a collection using the Hydra vocabulary. When the size of mebers is too big, data owners can fragment this into collection views. Each view represents a part of the collection to keep the Web API responses lightweight. In the case of hetarchief.be represents each newspaper HTTP document one view of the collection of newspapers. These views are linked together with partial collection view controls: previous, next, first and last. This allows a client to fetch all members of the collection.

Implementation

hetarchief.be

Every newspaper webpage is annotated with JSON-LD snippets containing domain-specific metadata and hypermedia controls. The former metadata is described using acknowledged vocabularies such as Dublin Core Terms (DCTerms), Friend of a Friend (FOAF), Schema.org etc. The latter is described using the Hydra vocabulary for hypermedia-driven Web APIs. Although hetarchief.be contains several human-readable hypermedia controls (free text search bar, search facets, pagination for every newspaper ) only Hydras partial collection view controls are implemented: hydra:next describes the next newspaper, vice versa hydra:previous. Also an estimate of the amount of triples on a page is added using hydra:totalItems and void:triples. This helps user agents to build more efficient query plans.

{
  "@context": "https://www.w3.org/ns/hydra/context.jsonld",
  "@id": "https://hetarchief.be/media/de-school-op-het-front-studiebladen-van-sursum-corda/CMEPpOVIRqYiVZSYd3Q3k8tL",
  "previous": "https://hetarchief.be/media/vrij-belgi%C3%AB/B1IVhaOMLFgCUGNJkVGuZH3S",
  "next": "https://hetarchief.be/media/vrij-belgi%C3%AB/J1cnCMfndMbBNrde9VxIyVpB",
  "totalItems": 50,
  "http://rdfs.org/ns/void#triples": 50
}

Fig. 2: Every newspaper describes its next and previous newspaper using Hydra partial collection view controls. This wires Linked Data Fragments together into a dataset.

In order to lower the barrier for automated Open Data reuse, information responses add the Cross-Origin Resource Sharing (CORS) header: Access-Control-Allow-Origin: * . Not all metadata of a newspaper falls under an Open License. In the process of digitizing these newspapers, Optical Character Recognition (OCR) is applied. According to the European copyright legislation content is still reserved by default 70 years after the death of the last author. This implies that these scanned texts cannot be published provisionally as Open Data. This is solved by publishing the OCR text in a seperate Linked Data document covered by a different terms of use. This also keeps the fragment size lean as such a document measures easily 50 kilobytes for four newspaper pages.

Building blocks Comunica

To make Comunica work with hetarchief.be, two additional actors were needed. First, we needed a generic actor to support pagination over any kind of hypermedia interface. Secondly, an actor was needed to parse JSON-LD data snippets from HTML documents. We will explain these two actors in more detail hereafter.

BusRdfResolveHypermedia is a bus in Comunica that resolves hypermedia controls from sources, Currently, this bus only contains an actor that resolves controls for TPF interfaces. We added a new actor (ActorRdfResolveHypermediaNextPage) to this bus that returns a search form containing a next page link, vice versa for previous page links.

The parsing of most common Linked Data formats (Turtle, TriG, RDF/XML, JSON-LD…) are already supported by Comunica. However, no parser for extracting data snippets from HTML documents existed yet. That is why we added an actor (ActorRdfParseHtmlScript) for parsing such HTML documents. This intermediate parser searches for data snippets and forwards these to their respective RDF parser. In case of a JSON-LD snippet, the body of a script tag <script type="application/ld+json"> will be parsed by the JSON-LD parse actor.

By adding these two actors to Comunica, we can now query over a paged collection that is declaratively described with data snippets. As federated querying comes out-of-the-box with Comunica, this cultural heritage collection can now be queried together with other knowledge bases (cfr. Wikidata). For example, retrieving basic information such as title, publication date etc. from 17 newspaper pages requires 1,5 minutes until all results are retrieved. This is caused by deficiency of indexes where all pages need examination before having a complete answer.

In next section we will demonstrate how SPARQL-querying can be applied for extracting a spreadsheet.

Demonstrator

This demonstrator shows that a non-technical user can create a data dump from the cultural heritage website hetarchief.be. More specifically, a spreadsheet can be extracted using SPARQL-querying from embedded paged collection views. The application is written with the front-end playground Codepen https:/​/​codepen.io/brechtvdv/pen/ebOzXB. A browser compatible library of Comunica is built using a custom configuration that can be found on Github (https:/​/​github.com/brechtvdv/hetarchief-comunica) under an Open License.

Listing 1: A spreadsheet is generated by entering a URL of a newspaper from hetarchief.be.

First, a user can insert a URL of a hypermedia-enabled LOD interface. For example, a user can go to hetarchief.be and select a newspaper as starting point. After pressing Start downloading, Comunica fetches the document located on the given URL and follows the embedded pagination controls. During querying, user feedback is provided with the amount of processed CSV rows and bytes. Next, the user can Copy the CSV output to its clipboard and save it using spreadsheet software. Optionally, a SPARQL-query can be configured to customize the desired outcome.

See the Pen Download your website as a spreadsheet by Brecht Van de Vyvere (@brechtvdv) on CodePen.

Conclusion

Data owners can publish their Linked Open Data very cost-efficient on their website with JSON-LD snippets. After an initial cost of adding this feature to their website, they can have an always up-to-date dataset with negligible maintenance costs, however, machine clients that query and harvest over websites can introduce unforeseen spikes of activity. Data owners will need to extend their monitoring capabilities to not only focus on human interaction (e.g. Google Analytics) and apply a HTTP caching strategy for stale resources.

Linked Data services (HDT [6] file, TPF interface…) with a higher maintenance cost can be created on top of JSON-LD snippets, but these would suffer from scalability problems: Optical Character Recognition (OCR) texts have bad compression rates, and thus require gigabytes of disk space. With our solution, these OCR-text are published in a seperate document keeping the maintenance cost low while harvesting in an automated way is still possible.

In future work, extending Comunica for harvesting Hydra collections would help organizations to improve their collection management. These collections could be defined on their main page of their website improving Open Data discoverability.