Lexical Linked Data Case Study: ALPINO Treebank – Part 2

Following on from the previous post, we will now create a SPARQL endpoint so that we can query the contents of the data. To do this we will use the light-weight engine 4store. The first task is to set up the task, on an Ubuntu based machine this is simply achieved with

sudo apt-get install 4store

Otherwise it may be necessary to install it following the instructions here.

Once 4store is installed we simply create a database, set up the back-end and load all data

4s-backend-setup alpino
4s-backend alpino
for file in `find . -name \*.rdf` 
do fileBase=`echo $file | sed 's/\\.\/\(.*\)\..*/\1/' ` 
4s-import alpino -v -a -m "http://lexinfo.net/corpora/alpino/$fileBase" $file 
done

Note, as the RDF files made by the XSLT do not specify the URI we must be careful when loading the data that 4store uses the right URIs.

Next we set-up the web connector at a random (firewalled) port

4s-httpd alpino -p 8888

Now we need to make it available to the web, we will do this through a PHP script, as the default HTTP interface for 4store is not particularly user friendly

I wrote the following PHP script for this:

<?php
if(!isset($_REQUEST["query"])) { ?>
<html>
 <head>
 <title>ALPINO corpus query</title>
 </head>
 <body>
 <form action="" method="get">
 <label for="query">Query:</label><br/>
 <textarea name="query" rows="5" cols="80">
PREFIX cat: <http://lexinfo.net/corpora/alpino/categories#> 
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT * WHERE { ?s ?p ?o } LIMIT 10
</textarea><br/>
 <input type="submit"/>
 </form>
 </body>
</html>
<? } else {
$ch = curl_init();
$url = "http://localhost:8888/sparql/?query=" . urlencode($_REQUEST["query"]);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($ch);
$code = curl_getinfo($ch,CURLINFO_HTTP_CODE);
if($code == 200) {
 header("Content-type: application/sparql-results+xml");
 echo $data;
} else {
 echo $data;
}
curl_close($ch);
}
?>

Now the final step is to register the resource with CKAN. To do this we simply go to the website, create a user account and fill in the form thus:

In particular we added the following URLs

Finally we send a mail to the open linguistics list to announce the Open Linguistics Working Group.

Lexical Linked Data Case Study: ALPINO Treebank

In this post, I will detail how to publish a linguistic resource as linked data from scratch. These instructions are based on a Linux server with apache2, but should apply to other server types as well. As a case study the ALPINO Treebank a treebank for Dutch in XML and released under the GPLv2, hence we can republish it in RDF as long as we make an attribution to the original authors.

We will start by obtaining the resource, decompressing it and removing the non-data folders

wget http://www.let.rug.nl/~vannoord/ftp/AlpinoCDROM/AlpinoCDROM.tgz
tar xzvf AlpinoCDROM.tgz
rm -fr Clig/ Papers/ stylesheets/ thistle-2-0-1/ xmlmatch/

Next, we do a simple RDF conversion, starting with this simple XSLT processor here http://www.gac-grid.de/project-products/Software/XML2RDF.html and we will use the xsltproc command to do it:

for file in *.xml
do xsltproc xml2rdf3.xsl $file >$file.rdf
done
rename .xml.rdf .rdf *.xml.rdf

Now we simply create a new folder on our apache2 server and copy the result there

cd /var/www/lexinfo.net/htdocs/
mkdir -p corpora/alpino
cp -r ~/AlpinoCDROM corpora/alpino
chown -R apache:apache corpora/

And now we see that the data is available

In fact, a linked data server is just a normal server that returns RDF data, we make a quick modification to the MIME types to make sure it returns the correct type in /etc/apache2/modules.d/00_mod_mime.conf (on my server, check your Linux Distros documentation) and then restart the server.

AddType application/rdf+xml .rdf
AddType text/turtle .ttl

# For type maps (negotiated resources):
AddHandler type-map var

We can check this works very simply as follows

jmccrae@greententacle ~/AlpinoCDROM $ curl -I -H "Accept: application/rdf+xml" http://lexinfo.net/corpora/alpino/cgn_exs/1.rdf
HTTP/1.1 200 OK
Date: Thu, 09 Aug 2012 19:13:58 GMT
Server: Apache
Last-Modified: Thu, 09 Aug 2012 19:12:38 GMT
ETag: "1c96013-6e9-4c6da02bdb580"
Accept-Ranges: bytes
Content-Length: 1769
Cache-Control: max-age=1209600
Expires: Thu, 23 Aug 2012 19:13:58 GMT
Content-Type: application/rdf+xml

So far, so good… the next step is to enable content negotiation, for Alpino we have an issue that the raw XML files are renamed without extension, therefore we move all these files to the extension .txt. Then in each file we create a document call .htaccess and add the following line to it.

Options +MultiViews

Now we test it and

jmccrae@greententacle ~ $ curl -I -H "Accept: application/rdf+xml" http://lexinfo.net/corpora/alpino/cgn_exs/1
HTTP/1.1 200 OK
Date: Thu, 09 Aug 2012 19:38:08 GMT
Server: Apache
Content-Location: 1.rdf
Vary: negotiate,accept
TCN: choice
Last-Modified: Thu, 09 Aug 2012 19:12:38 GMT
ETag: "1c96013-6e9-4c6da02bdb580;4c6da48d60b80"
Accept-Ranges: bytes
Content-Length: 1769
Cache-Control: max-age=1209600
Expires: Thu, 23 Aug 2012 19:38:08 GMT
Content-Type: application/rdf+xml

It works… Now to link it to something. Inspecting the data, there are three clear groups of categories in the corpus, “cat” for categories/phrase types, “rel” for dependency relations and “pos” for part-of-speech tags. Many of these can be aligned to a data category registry or linguistic ontology. I choose to provide alignments to ISOcat and to LexInfo. This was performed by creating an OWL ontology to describe the categories used in the resource, for example the following describes “adverbs” in ALPINO

<owl:NamedIndividual rdf:about="http://lexinfo.net/corpora/alpino/categories#adv">
   <rdf:type rdf:resource="http://lexinfo.net/corpora/alpino/categories#PartOfSpeech"/>
   <rdfs:label xml:lang="en">Adverb</rdfs:label>
   <dcr:datcat rdf:resource="http://www.isocat.org/datcat/DC-1232"/>
   <owl:sameAs rdf:resource="&lexinfo;adverb"/>
</owl:NamedIndividual>

Finally we modify the XSLT to use these new categories, in particular we modify the script at line 105 (green is new code), so that it generates a triple with a URI object as follows

<xsl:choose>
  <xsl:when test="name()='rel'">
    <cat:rel>
      <xsl:attribute name="rdf:resource">
        <xsl:value-of select="concat('http://lexinfo.net/corpora/alpino/categories#',.)"/>
      </xsl:attribute>
    </cat:rel>
  </xsl:when>
  <xsl:when test="name()='cat'">
    <cat:cat>
      <xsl:attribute name="rdf:resource">
        <xsl:value-of select="concat('http://lexinfo.net/corpora/alpino/categories#',.)"/>
      </xsl:attribute>
    </cat:cat>
  </xsl:when>
  <xsl:when test="name()='pos'">
    <cat:pos>
      <xsl:attribute name="rdf:resource">
        <xsl:value-of select="concat('http://lexinfo.net/corpora/alpino/categories#',.)"/>
      </xsl:attribute>
    </cat:pos>
  </xsl:when>
  <xsl:otherwise>
     <xsl:element name="{name()}" namespace="{$ns}">
       <xsl:value-of select="."/>
     </xsl:element>
   </xsl:otherwise>
 </xsl:choose>

We apply this and publish this and our ontology and have the first version of our linked data corpora.

Finally, I make the resource browsable by creating a zipped dump of all the data and a new index page

In Part 2, we set-up a SPARQL endpoint and register the resource with CKAN

Ontology File: http://lexinfo.net/corpora/alpino/categories.rdf

XSLT File: http://lexinfo.net/corpora/alpino/alpino_xml2rdf3.xsl