A simple regular expression for tokenization of (most) natural language

I often need to tokenize text and have generally relied on the following fairly simple regular expression to do the trick

string.split("\\s+")

\\s is the group for (ASCII) spaces and so while this works, in fact it quickly leads to problems, let’s take an example bit of text

So in “this test”, we wish to check tokenization; among other things. So…, we ask a question? A make a statement statement! (and maybe a note). I’ll check some other stuff, like we may have    exagerated    spacing! Or strange quotes, like «en français» or „auf Deutsch“.

So we quickly have the issue that we get tokens like “this or test”  that are not so good… instead we would like to have the quotation marks as a token by themselves. Now there is a regular expression we could use:

string.split("\\b")

However, this creates it’s own issues… firstly now each space is its own token… and compound punctuations doesn’t work, “like in quotes.” Even worse, contractions like “doesn’t” get split into three tokens. Not great… I present my solution:

string.replaceAll("(\\.\\.\\.+|[\\p{Po}\\p{Ps}\\p{Pe}\\p{Pi}\\p{Pf}\u2013\u2014\u2015&&[^'\\.]]|(?<!(\\.|\\.\\p{L}))\\.(?=[\\p{Z}\\p{Pf}\\p{Pe}]|\\Z)|(?<!\\p{L})'(?!\\p{L}))"," $1 ")
  .replaceAll("\\p{C}|^\\p{Z}+|\\p{Z}+$","")
  .split("\\p{Z}+")

Daunting, but I will attempt to explain it… the first line is where most of the magic happens… it is a optional regex group consisting of the following

  1. \\.\\.\\.+ : Captures any ellipses
  2. [\\p{Po}\\p{Ps}\\p{Pe}\\p{Pi}\\p{Pf}\u2013\u2014\u2015&&[^’\\.]]: Captures most single punctuation marks. We use mostly unicode categories, in particular all “other” punctuation, start and end punctuation (brackets, braces, etc.), initial and final quotes, and long dashes. Finally, the group has two unwanted elements “.” and “‘”, which are removed.
  3. (?<!(\\.|\\.\\p{L}))\\.(?=[\\p{Z}\\p{Pf}\\p{Pe}]|\\Z): This is for full stops, they are kind of hard, as we would like to avoid splitting “I.B.M.” and of course ellipses. First we use a zero-width look-behind assertion to check that we don’t have another full stop or a letter then a full stop. The we look forward and check that the next character is space, an end punctuation, a final quote or the end of the string (that is \\Z)
  4. (?<!\\p{L})'(?!\\p{L})): This finally matches all single quotes, that aren’t  between two letters… ’tis not always correct, but…

The replacement string is then simply whatever matched with a space either side. This generates some extra spaces, of course.

The next pass is quite simple… we remove all initial and trailing spaces, as well as any control characters… it is important to use the Unicode category here as \\s does not match non-breaking space, which occurs quite often in some corpora (e.g., Wikipedia)… and allows you to draw triforces, of course.

Finally, we split the text according to the Unicode spaces that are now in the text. This also eliminates all the extra spaces we created in the first step

This tokenization is a little different to some of the more widely known ones, such as the Penn Tree Bank method or Lucene’s, however it does not change the original text other than white-space and is very easy-to-use, plus it has no special rules for English and should work well on most languages (with some obvious exception such as Chinese, Japanese and Korean).

Leave a comment