Integrating Wikidata and other linked data sources – Federated SPARQL queries

This blog is about running federated SPARQL queries on Wikidata.  A federated query is a special type of SPARQL query that runs on more then one SPARQL endpoint. It allows access to multiple linked data resources in a single query. Below is a template of a federated query.

Structure of a federated query. It contains query patterns for both the local endpoint (green box) and a remote endpoint (blue box). The address of the remote SPARQL endpoint is expressed with the SERVICE keyword. (sourcecode)

The WikiData Query Service (WDQS) now supports federated SPARQL queries on a limited number of endpoints. Remote SPARQL endpoints can be added to the list of supports endpoints by nomination.

Wikidata aims to provide the sum of all knowledge to the world at large. To fulfil this,  it needs to be a hub to the total knowledge space. After all, different technical, legal or social limitations exist to include all in a single repository. Through federation or distributed querying, local and remote data can be combined.  Here we explore three ways to apply this type of querying on Wikidata content. 

 

From Wikidata to an external SPARQL endpoint (Wikipathways)

The following query applies federation to integrate between a pathway from Wikipathways and Wikidata. Wikidata contains items on human pathways from Wikipathways. Metabolic interactions are not yet captured in Wikidata. Through federation, these metabolic interaction can be obtained. In reverse direction is it possible to obtain properties of pathway elements from Wikidata. Take for example the “Sudden Infant Death Syndrome (SIDS) Susceptibility Pathways (Homo sapiens)” pathway. It contains various biological interactions. We could now, using federated queries, get properties such as the mass of a given pathway element. 

Sudden Infant Death Syndrome (SIDS) Susceptibility Pathways (Homo sapiens)” pathway (http://www.wikipathways.org/index.php/Pathway:WP706)

Using this query as input, with a federated query it is possible to enrich this pathway with properties not captured in Wikipathways. One example would be the following query that takes interactions from the above pathway and combines that with the mass of the individual pathway parts.

You can run this query here or watch it run on youtube

From a remote SPARQL endpoint to Wikidata

If a remote SPARQL endpoint is not (yet) eligible to be used in the WDQS, it is possible to run the query from the external endpoint. That is, if the external endpoint accepts submitting federated SPARQL queries. The SPARQL endpoint of UniProt is a nice example.  UniProt includes much more properties for proteins than currently captured in Wikidata. Properties that due to the more restrictive nature of the applicable license can’t be included in Wikidata. The following federated SPARQL endpoint runs on the SPARQL endpoint of UniProt. It selects all human UniProt entries with a sequence variants that leads to a loss of function, and also physically interacts with a drug used as an enzyme inhibitor.

Integrating Wikidata content with data from UniProt using federated query submitted at http://sparql.uniprot.org

Try it…

 

From a local SPARQL endpoint to Wikidata (Data from a database that will never go into WD)

Another interesting use case where federation can be quite handy, is in the context of local data. Epidemiological data on for example Zika outbreaks can contain large set of measurements spread over multiple time frames. Loading those measurements into Wikidata, can be be difficult especially if the outbreak is ongoing resulting in new data arriving in rapid intervals. One solution to enable integration with that data and other resources like Wikidata is running distributed queries from a local SPARQL endpoint. The local SPARQL endpoint has two roles, first it collects the measurements from the different Zika studies, secondly federated queries can be executed to enrich these measurements with knowledge from Wikidata.  We have create an example script that takes data on Zika outbreaks, converts that to linked data as RDF, which is then loaded in a local SPARQL endpoint. This prototype is available on github.

This approach also works when one would like to integrate sensitive data (clinical patient data) with external WIkidata knowledge if the local endpoint is maintained from within a secure infrastructure which allows getting data from outside the infrastructure, but prevents exports.  

 

Indeed SPARQL has a steep learning curve.

Although writing SPARQL queries can be perceived as being quite intimidating, the feature to run federated queries on Wikidata content is very valuable and needed to make Wikidata the central hub of research data in the life sciences. The effort to learn SPARQL is worth it. Fortunately, Wikidata provides a large set of examples. Either for inspiration or as learning material.
There are also quite some developments to make writing queries easier.

There is also a R package that integrates SPARQL with R scripts, where the example queries from Wikidata can be scraped, which means that one could use the advantages SPARQL offers without writing a SPARQL query, simply by building on what others already made.

Finally, there is always the help of the Twitter space. Many are quite eager to share SPARQL knowledge.


Leave a Reply

Your email address will not be published. Required fields are marked *