A simple regular expression for tokenization of (most) natural language

I often need to tokenize text and have generally relied on the following fairly simple regular expression to do the trick

string.split("\\s+")

\\s is the group for (ASCII) spaces and so while this works, in fact it quickly leads to problems, let’s take an example bit of text

So in “this test”, we wish to check tokenization; among other things. So…, we ask a question? A make a statement statement! (and maybe a note). I’ll check some other stuff, like we may have    exagerated    spacing! Or strange quotes, like «en français» or „auf Deutsch“.

So we quickly have the issue that we get tokens like “this or test”  that are not so good… instead we would like to have the quotation marks as a token by themselves. Now there is a regular expression we could use:

string.split("\\b")

However, this creates it’s own issues… firstly now each space is its own token… and compound punctuations doesn’t work, “like in quotes.” Even worse, contractions like “doesn’t” get split into three tokens. Not great… I present my solution:

string.replaceAll("(\\.\\.\\.+|[\\p{Po}\\p{Ps}\\p{Pe}\\p{Pi}\\p{Pf}\u2013\u2014\u2015&&[^'\\.]]|(?<!(\\.|\\.\\p{L}))\\.(?=[\\p{Z}\\p{Pf}\\p{Pe}]|\\Z)|(?<!\\p{L})'(?!\\p{L}))"," $1 ")
  .replaceAll("\\p{C}|^\\p{Z}+|\\p{Z}+$","")
  .split("\\p{Z}+")

Daunting, but I will attempt to explain it… the first line is where most of the magic happens… it is a optional regex group consisting of the following

  1. \\.\\.\\.+ : Captures any ellipses
  2. [\\p{Po}\\p{Ps}\\p{Pe}\\p{Pi}\\p{Pf}\u2013\u2014\u2015&&[^’\\.]]: Captures most single punctuation marks. We use mostly unicode categories, in particular all “other” punctuation, start and end punctuation (brackets, braces, etc.), initial and final quotes, and long dashes. Finally, the group has two unwanted elements “.” and “‘”, which are removed.
  3. (?<!(\\.|\\.\\p{L}))\\.(?=[\\p{Z}\\p{Pf}\\p{Pe}]|\\Z): This is for full stops, they are kind of hard, as we would like to avoid splitting “I.B.M.” and of course ellipses. First we use a zero-width look-behind assertion to check that we don’t have another full stop or a letter then a full stop. The we look forward and check that the next character is space, an end punctuation, a final quote or the end of the string (that is \\Z)
  4. (?<!\\p{L})'(?!\\p{L})): This finally matches all single quotes, that aren’t  between two letters… ’tis not always correct, but…

The replacement string is then simply whatever matched with a space either side. This generates some extra spaces, of course.

The next pass is quite simple… we remove all initial and trailing spaces, as well as any control characters… it is important to use the Unicode category here as \\s does not match non-breaking space, which occurs quite often in some corpora (e.g., Wikipedia)… and allows you to draw triforces, of course.

Finally, we split the text according to the Unicode spaces that are now in the text. This also eliminates all the extra spaces we created in the first step

This tokenization is a little different to some of the more widely known ones, such as the Penn Tree Bank method or Lucene’s, however it does not change the original text other than white-space and is very easy-to-use, plus it has no special rules for English and should work well on most languages (with some obvious exception such as Chinese, Japanese and Korean).

Lexical Linked Data Case Study: ALPINO Treebank – Part 2

Following on from the previous post, we will now create a SPARQL endpoint so that we can query the contents of the data. To do this we will use the light-weight engine 4store. The first task is to set up the task, on an Ubuntu based machine this is simply achieved with

sudo apt-get install 4store

Otherwise it may be necessary to install it following the instructions here.

Once 4store is installed we simply create a database, set up the back-end and load all data

4s-backend-setup alpino
4s-backend alpino
for file in `find . -name \*.rdf` 
do fileBase=`echo $file | sed 's/\\.\/\(.*\)\..*/\1/' ` 
4s-import alpino -v -a -m "http://lexinfo.net/corpora/alpino/$fileBase" $file 
done

Note, as the RDF files made by the XSLT do not specify the URI we must be careful when loading the data that 4store uses the right URIs.

Next we set-up the web connector at a random (firewalled) port

4s-httpd alpino -p 8888

Now we need to make it available to the web, we will do this through a PHP script, as the default HTTP interface for 4store is not particularly user friendly

I wrote the following PHP script for this:

<?php
if(!isset($_REQUEST["query"])) { ?>
<html>
 <head>
 <title>ALPINO corpus query</title>
 </head>
 <body>
 <form action="" method="get">
 <label for="query">Query:</label><br/>
 <textarea name="query" rows="5" cols="80">
PREFIX cat: <http://lexinfo.net/corpora/alpino/categories#> 
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT * WHERE { ?s ?p ?o } LIMIT 10
</textarea><br/>
 <input type="submit"/>
 </form>
 </body>
</html>
<? } else {
$ch = curl_init();
$url = "http://localhost:8888/sparql/?query=" . urlencode($_REQUEST["query"]);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($ch);
$code = curl_getinfo($ch,CURLINFO_HTTP_CODE);
if($code == 200) {
 header("Content-type: application/sparql-results+xml");
 echo $data;
} else {
 echo $data;
}
curl_close($ch);
}
?>

Now the final step is to register the resource with CKAN. To do this we simply go to the website, create a user account and fill in the form thus:

In particular we added the following URLs

Finally we send a mail to the open linguistics list to announce the Open Linguistics Working Group.

Lexical Linked Data Case Study: ALPINO Treebank

In this post, I will detail how to publish a linguistic resource as linked data from scratch. These instructions are based on a Linux server with apache2, but should apply to other server types as well. As a case study the ALPINO Treebank a treebank for Dutch in XML and released under the GPLv2, hence we can republish it in RDF as long as we make an attribution to the original authors.

We will start by obtaining the resource, decompressing it and removing the non-data folders

wget http://www.let.rug.nl/~vannoord/ftp/AlpinoCDROM/AlpinoCDROM.tgz
tar xzvf AlpinoCDROM.tgz
rm -fr Clig/ Papers/ stylesheets/ thistle-2-0-1/ xmlmatch/

Next, we do a simple RDF conversion, starting with this simple XSLT processor here http://www.gac-grid.de/project-products/Software/XML2RDF.html and we will use the xsltproc command to do it:

for file in *.xml
do xsltproc xml2rdf3.xsl $file >$file.rdf
done
rename .xml.rdf .rdf *.xml.rdf

Now we simply create a new folder on our apache2 server and copy the result there

cd /var/www/lexinfo.net/htdocs/
mkdir -p corpora/alpino
cp -r ~/AlpinoCDROM corpora/alpino
chown -R apache:apache corpora/

And now we see that the data is available

In fact, a linked data server is just a normal server that returns RDF data, we make a quick modification to the MIME types to make sure it returns the correct type in /etc/apache2/modules.d/00_mod_mime.conf (on my server, check your Linux Distros documentation) and then restart the server.

AddType application/rdf+xml .rdf
AddType text/turtle .ttl

# For type maps (negotiated resources):
AddHandler type-map var

We can check this works very simply as follows

jmccrae@greententacle ~/AlpinoCDROM $ curl -I -H "Accept: application/rdf+xml" http://lexinfo.net/corpora/alpino/cgn_exs/1.rdf
HTTP/1.1 200 OK
Date: Thu, 09 Aug 2012 19:13:58 GMT
Server: Apache
Last-Modified: Thu, 09 Aug 2012 19:12:38 GMT
ETag: "1c96013-6e9-4c6da02bdb580"
Accept-Ranges: bytes
Content-Length: 1769
Cache-Control: max-age=1209600
Expires: Thu, 23 Aug 2012 19:13:58 GMT
Content-Type: application/rdf+xml

So far, so good… the next step is to enable content negotiation, for Alpino we have an issue that the raw XML files are renamed without extension, therefore we move all these files to the extension .txt. Then in each file we create a document call .htaccess and add the following line to it.

Options +MultiViews

Now we test it and

jmccrae@greententacle ~ $ curl -I -H "Accept: application/rdf+xml" http://lexinfo.net/corpora/alpino/cgn_exs/1
HTTP/1.1 200 OK
Date: Thu, 09 Aug 2012 19:38:08 GMT
Server: Apache
Content-Location: 1.rdf
Vary: negotiate,accept
TCN: choice
Last-Modified: Thu, 09 Aug 2012 19:12:38 GMT
ETag: "1c96013-6e9-4c6da02bdb580;4c6da48d60b80"
Accept-Ranges: bytes
Content-Length: 1769
Cache-Control: max-age=1209600
Expires: Thu, 23 Aug 2012 19:38:08 GMT
Content-Type: application/rdf+xml

It works… Now to link it to something. Inspecting the data, there are three clear groups of categories in the corpus, “cat” for categories/phrase types, “rel” for dependency relations and “pos” for part-of-speech tags. Many of these can be aligned to a data category registry or linguistic ontology. I choose to provide alignments to ISOcat and to LexInfo. This was performed by creating an OWL ontology to describe the categories used in the resource, for example the following describes “adverbs” in ALPINO

<owl:NamedIndividual rdf:about="http://lexinfo.net/corpora/alpino/categories#adv">
   <rdf:type rdf:resource="http://lexinfo.net/corpora/alpino/categories#PartOfSpeech"/>
   <rdfs:label xml:lang="en">Adverb</rdfs:label>
   <dcr:datcat rdf:resource="http://www.isocat.org/datcat/DC-1232"/>
   <owl:sameAs rdf:resource="&lexinfo;adverb"/>
</owl:NamedIndividual>

Finally we modify the XSLT to use these new categories, in particular we modify the script at line 105 (green is new code), so that it generates a triple with a URI object as follows

<xsl:choose>
  <xsl:when test="name()='rel'">
    <cat:rel>
      <xsl:attribute name="rdf:resource">
        <xsl:value-of select="concat('http://lexinfo.net/corpora/alpino/categories#',.)"/>
      </xsl:attribute>
    </cat:rel>
  </xsl:when>
  <xsl:when test="name()='cat'">
    <cat:cat>
      <xsl:attribute name="rdf:resource">
        <xsl:value-of select="concat('http://lexinfo.net/corpora/alpino/categories#',.)"/>
      </xsl:attribute>
    </cat:cat>
  </xsl:when>
  <xsl:when test="name()='pos'">
    <cat:pos>
      <xsl:attribute name="rdf:resource">
        <xsl:value-of select="concat('http://lexinfo.net/corpora/alpino/categories#',.)"/>
      </xsl:attribute>
    </cat:pos>
  </xsl:when>
  <xsl:otherwise>
     <xsl:element name="{name()}" namespace="{$ns}">
       <xsl:value-of select="."/>
     </xsl:element>
   </xsl:otherwise>
 </xsl:choose>

We apply this and publish this and our ontology and have the first version of our linked data corpora.

Finally, I make the resource browsable by creating a zipped dump of all the data and a new index page

In Part 2, we set-up a SPARQL endpoint and register the resource with CKAN

Ontology File: http://lexinfo.net/corpora/alpino/categories.rdf

XSLT File: http://lexinfo.net/corpora/alpino/alpino_xml2rdf3.xsl