How does a child normally learn to read in its native language? First it gets to know all the graphical symbols (letters, punctuation marks, accent marks, and so on) that are used to write that language. Next it learns the relationships between symbols and sounds, and learns to connect the sounds into words and phrases while reading. After some time the child is able to interpret even whole sentences at a single glance.
How can we teach a computer to read text? Can it learn like a human does, or does it require a different method? This is the question we will be considering in this post.
Assume that our computer “pupil” already knows the alphabet of a certain language. How can we give it the ability to identify words? This task might seem trivial – you just need to take together all the letters that stand adjacent to each other between spaces! However, as we will soon see, this simple recipe doesn’t always work.
Consider the following text:
No one in Las Vegas spends less than $10 000.
The logical way of dividing up this sentence into units should be this:
No one | in | Las Vegas | spends | less | than | $ | 10 000 | .
This is because the units “No one”, “Las Vegas” and “10 000”, although they contain spaces, constitute indivisible semantic items.However, if we divide the sentence according to the “space to space” rule, we get something quite different:No | one | in | Las | Vegas | spends | less | than | $10 | 000.
The above logical division of the text into units does not entirely correspond to the traditional division into words. In computer processing, in fact, the concept of a token is used. This refers to a sequence of consecutive characters, not necessarily alphabetical. In particular, a token may be a word, a number, a date, a telephone number, or a Web address.
The human activity of dividing a text into words corresponds to the computer term tokenisation, which means dividing a text into tokens.
Full stops (periods) cause significant difficulties when attempting to divide text into tokens. Consider this sentence:
Write to Dr. Kowalski at kowalski@poleng.pl.
The desired division of this text into tokens looks like this:
Write | to | Dr. | Kowalski | at | kowalski@poleng.pl | .However, in order to “tokenise” the sentence in this way, the computer program must be able to distinguish the periods that appear in abbreviations or e-mail addresses from those that mark the end of a sentence. Similar kinds of problems are caused by certain other signs, including hyphens, dashes, apostrophes, and emoticons.How do computers cope with dividing text?
There are two basic approaches to the task of tokenisation. In the case of dictionary- and rule-based methods, lexicons are created to list specific words, such as abbreviations ending with a period. In addition, patterns are defined for selected tokens (such as dates, Web addresses, etc.) – usually in the form of regular expressions – to enable such items to be identified in a text.
In methods that utilise machine learning, training data is supplied in the form of examples of correctly tokenised texts. The system is expected to learn to divide new texts solely on the basis of that data.
There are some languages that do not use separators between words. One of these is the official language of the People’s Republic of China, Mandarin Chinese.
To divide a text in this language into tokens, one may apply what is known as the MaxMatch algorithm.
The MaxMatch algorithm assumes the existence of a lexicon containing all of the words of a given language. Starting from the first character of the text we want to analyse, we identify the longest sequence of characters that appears in that lexicon, and then split it off from the rest of the text. (If no word from the lexicon was found, we split off the first character of the text.) This procedure is repeated for the remainder of the text.Let’s see how this algorithm would work for an example English sentence, if words in that language were not separated by spaces. Consider the sentence:Addtheyeasttothedough.
We would like to see it divided thus:
Add | the | yeast | to | the | dough | .
No luck, however – the MaxMatch algorithm will return the following division:
Add | they | east | tot | he | dough | .
This shows that the spaces we use to separate words in our language are a real godsend!
It would appear that a sentence is the basic fragment of text having a consistent interpretation. We read sentence by sentence; we speak sentence by sentence; we translate texts sentence by sentence. A computer tries to imitate us in this regard by attempting to divide a text into sentences. But an apparently simple rule, like “Divide after a full stop and before a space and a capital letter”, can sometimes let us down:
Undivided text:
In 2019 Mr. Brown visited many different countries, e.g. Germany, France and Canada. He simply loves to travel!
Divided text:
In 2019 Mr. | Brown visited many different countries, e.g. | Germany, France and Canada. | He simply loves to travel!
As you will notice, there are rather too many sentence divisions here.
Like in the case of dividing a sentence into tokens, there are two approaches to dividing a text into sentences:
When neural networks began to play an important role in natural language processing (the “neural breakthrough” in machine translation came in 2014), it turned out that the traditional way of dividing text into tokens was unsuitable. The input layer to neural networks has to consist of a vector of dimension equal to the number of all words used in the document being processed – in extreme cases this can be in the millions. Unfortunately, computations in such enormous networks are still beyond the reach of the machines being built today.
A way therefore had to be found to compress the dictionary – namely to reduce its volume as counted in words. One such method is the BPE algorithm. The input data to this algorithm consists of the text (or set of texts) that is to be processed, along with the target size of the dictionary, which in practical scenarios should not be much greater than 50,000. The items appearing in such a dictionary are known as subwords.
The BPE algorithm is based on the merging of the most frequent bigrams, where a bigram is a pair of neighbouring fragments of text.
Let us analyse the action of the BPE algorithm when it receives as input a target dictionary size of 14 and the following text:
Hurry on, Harry, hurry on!
The initial vocabulary consists of single characters:
{<w>, h, u, r, y, o, n, a}
The size of vocabulary is 8.
In the first step, punctuation marks are removed from the text, and the text is divided into single letters. Word boundaries (denoted here as <w>) are treated the same as a letter:
<w> h u r r y <w> o n <w> h a r r y <w> h u r r y <w> o n <w>
The numbers of occurrences of individual bigrams in the given text are as follows:
<w> h | 3 |
h u | 2 |
u r | 2 |
r r | 3 |
r y | 3 |
y <w> | 3 |
<w> 0 | 2 |
o n | 2 |
n <w> | 2 |
h a | 1 |
a r | 1 |
We begin by merging the first bigram that has three occurrences: <w> h.
<w>h u r r y <w> o n <w>h a r r y <w>h u r r y <w> o n <w>
The current vocabulary now consists of 9 subwords:
{<w>, h, u, r, y, o, n, a, <v>h}
This has eliminated the bigrams h u and h a, since in all cases the letter h has been joined to the <w> sign. Similarly, single instances of the y <w> and n <w> bigrams have been lost.
The next bigram with three occurrences is r r. We now merge this into a single unit:
<w>h u rr y <w> o n <w>h a rr y <w>h u rr y <w> o n <w>
The current vocabulary now consists of 10 subwords:
{<w>, h, u, r, y, o, n, a, <v>h, rr}
This merging leads to the disappearance of the bigrams u r, r y and a r, but we can still merge rr and y.
The current vocabulary now consists of 11 subwords:
{<w>, h, u, r, y, o, n, a, <v>h, rr, rry}
There are no more bigrams with three occurrences, so we go on to those with two. The first of these (in order of appearance) is <w>h u.
<w>hu rry <w> o n <w>h a rry <w>hu rry <w> o n <w>
The current vocabulary now consists of 12 subwords:
{<w>, h, u, r, y, o, n, a, <v>h, rr, rry, <w>hu}
We next merge further bigrams having two occurrences:
<w>hurry<w> o n <w>h a rry <w>hurry <w> o n <w>
<w>hurry<w> o n <w>h a rry <w>hurry<w> o n <w>
In the last of these representations, the dictionary has been increased to 14 subwords, which was the number required:
{<w>, h, u, r, y, o, n, a, <v>h, rr, rry, <w>hu, <w>hurry, <w>hurry<w>}
It is therefore time to finish… both the running of the algorithm, and this blog post.
Humans would like AI to process a natural language in the same way that they do. The first systems for natural language analysis were thus built on rules for building sentences similar to those we remember from grammar lessons. Machine translation systems took their translation rules directly from foreign language textbooks.
As it turned out, however, the real breakthrough in computer natural language processing came only when we allowed machines to come into their own – to learn independently on the basis of data that we supplied to them. Until recently, it seemed that dividing text into words, sentences and paragraphs would remain the domain of rule-based systems, operating based on analogies with actions performed by humans. Today, it seems that in this task, as in many others, machines will find their own way.