Word segmentation is a fundamental problem in natural language processing. This post presents an efficient approach to word segmentation of Vietnamese texts with a good accuracy. We first design regular expressions to extract a wide variety of token types from a text. We then introduce two main ambiguous types of Vietnamese word segmentation, overlap ambiguity and combination ambiguity.

What is Word Segmentation?

Word segmentation is a fundamental problem in natural language processing. It is the problem of segmenting written text into individual words or tokens. Basically, lexical units are the smallest meaningful lexical units understandable by human. When reading text, we human have mental processes to segment text into tokens. But how can we implement these processes in a computer program as a first step to help computer understand a text?

Some written languages have explicit word boundary marker such as the space character in written English or other occidental languages. For these languages, word segmentation is an easy problem. For example, given an English sentence:

  • On the evening of March 31st, Elon Musk unveiled Tesla’s sinuous Model 3, the company’s first affordable electric-car model.

This sentence can be easily tokenized into tokens by using the space character, punctuations and other delimiter characters.

  • [On] [the] [evening] [of] [March] [31st] [,] [Elon] [Musk] [unveiled] [Tesla]… [electric-car] [model] [.]

However, for many languages such as Vietnamese, the space is not always a word separator and word segmentation is thus a non-trivial problem. As an example, the following sentence

  • Pep Guardiola sẽ có cơ hội phục hận cho đội bóng ông từng gắn bó nhiều năm.

should be segmented as follows

  • [Pep Guardiola] [sẽ] [có] [cơ hội] [phục hận] [cho] [đội] [bóng] [ông] [từng] [gắn bó] [nhiều] [năm] [.]

As seen, in Vietnamese texts, spaces are used to separate syllables, not words, and a compound word can have multiple syllables. It is worthy to note that while most of syllables are words by themselves, there are many syllables cannot be used as a word. For example, “thạc sĩ” is a word but “thạc” is not a word in the Vietnamese dictionary.

Why is Vietnamese Word Segmentation Hard?

As presented in the previous section, Vietnamese word segmentation is not a trivial problem. Some Vietnamese computer scientists have tackled this problem and developed some effective computer algorithms to solve it with a good accuracy.

The major difficulty of Vietnamese word segmentation is the ambiguity of the space character. The main question is how we determine automatically which spaces are word separators, and which spaces are syllable separators? In a simple phrase containing three consecutive syllables “a b c”, there are four possible segmentations: “[a] [b] [c]”, “[a b] [c]”, “[a] [b c]”, and “[a b c]”. For a longer phrase, the number of segmentation possibilities increases exponentially.

A careful analysis of ambiguous cases reveals that there are two main types of ambiguity when segmenting phrases:

1.      Overlap ambiguity: given a phrase of three syllables “a b c”, either “[a] [b c]” or “[a b] [c]” is valid segmentation. For example, “thuộc địa bàn” has two plausible segmentations “[thuộc địa] [bàn]”or “[thuộc] [địa [bàn]”; similarly, “tổ hợp âm” can be divided as “[tổ hợp] [âm]” or “[tổ] [hợp âm]”. Determining what segmentation is better than the other is not an obvious problem per se.

2.      Combination ambiguity: given a phrase of two syllables “a b”, either “[a b]” or “[a] [b]” is valid segmentation and a good choice is not possible without knowing the meaning of the phrase in context. For example, two syllables “chanh chua” should be treated as a word “[chanh chua]” in the sentence “Cô gái chanh chua”. However, they should be separated into two words “[chanh] [chua]” in the sentence “Cô gái ăn quả chanh chua”.

In addition, recognition of named entities in written text is ambiguous in many cases. Since a named entity is signified by capitalized letters of at the beginning of concerned syllables, such as “thủ tướng Nguyễn Tấn Dũng”, we can derive a rule to capture this regularity and extract “Nguyễn Tấn Dũng” as a token. However, what if we see a sentence opening with “Ông Nguyễn Tấn Dũng”, or “Theo Đài Hà Nội”, or “Theo Walcott đã ghi bàn thắng”?

Regular Expressions

Regular expression is a powerful search pattern language which can help define and match a wide range of patterns. Regular expressions are mainly used in pattern matching with strings. Learning to construct and to use regular expressions is a subject in formal language theory and theoretical computer science.

Many programming languages provide regular expression (regexp) capabilities, some built-in (e.g., Perl or JavaScript), some via standard library (e.g., Java, Python or C/C++). In the following, we describe some basic constructs of regexp and their examples in the Java programming language.

First, some basic character classes are:

[abc] a, b or c
[^abc] any character except a, b, or c
[a-z] a through z, inclusive
[A-Z] A through Z, inclusive
. any character
\d a digit: [0-9]
\D a non-digit: [^\d]
\s a space character, including a new-line
\S a non-space character
\w a word character
\W a non-word character

Common greedy quantifiers are:

X? X, once or not at all
X* X, zero or more times
X+ X, one or more times
X{n} X, exactly n times
X{n,} X, at least n times
X{n,m} X, at least n but not more than m times

Note that the characters ‘.’, ‘-‘, ‘?’, ‘*’, ‘+’, ‘[‘, ‘]’,… are called meta-characters since they are used to designate patterns. To represent a meta-character, we need to “escape” it with a preceding backslash, for example \. for the dot character, \* for the star character, etc.

By combining basic regexp constructs above, we can now define non-trivial text patterns. For example:

Regexp Description Example
\d{4} a year of 4 digits 2016, 1890
\d{4}\ – \d{4} a duration with a start year and an end year 1890-1969
(0*[1-9]|1[012])[\-/\.]\d{4} a date of format mm-yyyy, or mm/yyyy, or mm.yyy 10-1980, 07/2012, 9/2015
[\+\-]?([0-9]*)?[0-9]+([\.,]\d+)* any number, positive or negative, integer or real in English or Vietnamese format 22.30, -22,30, +34,567,89

We can also extract named entities such as Elon Musk or Pep Guardiola in the example sentences by designing an appropriate regular expression matching them. A named entity contains one or more syllables whose first letters are capitalized and the syllables are separated by space characters, we can thus specify the following regexp to match against English named entities: ([A-Z][a-z]*)([\s+\-][A-Z][a-z]+)*. For Vietnamese named entities, it is a little bit more complicated because we have to deal with Unicode characters. To match a named entity such as Nguyễn Tấn Dũng, we can specify the following regexp: ([\p{Lu}][\p{L}&&[^\p{Lu}]]*)([\s+\-][\p{Lu}][\p{L}&&[^\p{Lu}]]+)*. A curious reader should spend some time to convince oneself of the correctness of the expression.

In a similar way, we can build regular expressions to match and extract a wide variety type of tokens in a Vietnamese or English text, such as allcaps or abbreviations (e.g., TP HCM, ĐHQG, etc.), emails, web links, dates and times, etc. This is the first essential step of Vietnamese word segmentation. In a next post, we shall describe how phrases can be segmented, including how segmentation ambiguities can be efficiently handled.

Le Hong Phuong – FHO


Related posts:
  • 1