In automatic text classification, a computer system has the job of assigning texts to defined categories. In some applications the system itself decides how to define the categories (classes) of texts. For example, if someone wants to classify the latest news reports, but is not certain what topics they will relate to, they may tell the artificial intelligence (AI) system to divide the set of reports into a specified number of classes (10, say). The system will then group the reports in such a way that each category contains texts that use similar vocabulary.
Under the alternative scenario, it is the human who decides how to define the classes to which the texts are to be assigned. For example, in an e-mail classification task (see also: /en/blog/nlp-what-is-it-and-what-can-it-be-used-for) we may ask the AI system to classify each e-mail message according to its usefulness for the addressee, or to place it appropriately into the user’s spam folder (the SPAM category) or their inbox (the NON-SPAM category).
Similarly, in sentiment analysis (see: /en/blog/nlp-what-is-it-and-what-can-it-be-used-for) the classifier has to analyse reviews and place them into three predefined categories according to the writer’s attitude to the subject: negative, neutral or positive.
The basic technique for dividing texts into categories that have been predefined by a human is Bayes’ method. This is the classification method that I will be describing in this blog post.
Bayes’ method is based on supervised learning. This means that the AI system learns to perform a certain task by analysing a set of data for which the task has already been performed. In the case of spam recognition, the system learns the task with the help of e-mail messages that have already been classified appropriately by a human being.
Preparing data for the purpose of machine learning is a task that is quite time-consuming and intellectually uninteresting. It requires many repetitions of similar actions, the quality of whose performance is hard to verify reliably.
An effective way of overcoming such problems has been developed by Amazon, in the form of a model of working called the Amazon Mechanical Turk. This is used for tasks that require human intelligence, but not necessarily expertise in a particular field. In this model, tasks are assigned to a suitably large group of workers (called – in accordance with “the best principles of political correctness” – Mechanical Turks) in such small portions that each of them can perform their part at a time of day that suits them, with the ability to take a break from the task whenever necessary (for example, when getting off the bus or tram). Moreover, exactly the same portions of work are assigned to different workers who have no contact with each other, thus making it possible to verify the answers obtained and to prioritise those that are given consistently by multiple different assessors.
In the case of an e-mail classification task, the smallest portion of work to be done by the Mechanical Turk will be to assign a single e-mail message to one of two classes: SPAM or NON-SPAM. The collective results of the work of a large group of Mechanical Turks can then serve as a large training corpus to be used to prepare an AI system – in this case, one that is to perform the task of filtering out incoming spam e-mails.
A problem with AI, however, is that it can never be trusted with 100% certainty. In the case of a classification task, an AI system will only be able to determine that a given object probably belongs to a particular category. If the user requires the system to indicate a specific class to which an object is to be assigned (which is usually the case in practice), the system will choose the class for which the computed probability of the object’s belonging to that class takes the greatest value.
The Bayes classifier determines the probability of an object’s belonging to a particular class by making use of two factors: prior probability and posterior probability.
The first of these values is determined before starting to analyse the object. For example, the prior probability that a given e-mail is spam is determined purely from data on the percentage of e-mails in the training set that were placed in the SPAM category. If, for example, the training set consisted of 100 e-mails, and 30 of them were classified as spam, then the prior probability that a given e-mail is spam will be 30%.
Posterior probability is computed following analysis of the object. Its relevant characteristics are identified, and then it is established to which class those characteristics best correspond.
The list above contains the subject lines of e-mails which were classified as spam. Assume that these make up the training set, and on this basis the system has to classify a new e-mail with the following subject line:
Bargain! Subscription for $20 a month. Check it out!
One characteristic of this message is the use of exclamation marks. It turns out that this is a typical feature of messages classified as spam, as it occurs as many as five times in the ten elements of the training set that were placed in the SPAM category.
In practice, the Bayes classifier is used in such a way that for each of the characteristics of a given object, the system computes to what degree that characteristic is typical of a given class. In the case of text classification, for each word or punctuation mark in a given document, it is determined with what frequency it appears in all of the documents assigned to a particular category. The values obtained for all such characteristics are then multiplied together to obtain the value of the posterior probability.
The system places the object in the class for which the product of the two probabilities – the prior probability and the posterior probability – takes the greatest value.
Simple? The method is certainly not exceptionally complicated, but at the same time it is extremely effective. In fact, the effectiveness of the Bayes classifier is often taken as a benchmark for other more sophisticated solutions. It turns out to be not at all easy to beat!
Artificial intelligence (AI) is a type of computer system whose task is to imitate actions performed by a human.
How can we evaluate whether an AI system performs its task effectively? I will try to answer that question in this blog.
For a long time it was believed that chess, the king among board games, was so complex that no digital machine would ever be able to defeat a human grandmaster. The first time a computer (specifically, the program Deep Blue) beat the world champion (Garry Kasparov) was in 1996 – although it should be recalled that Kasparov came back to win several succeeding games. Since that time computing power has grown so much that today, even the greatest human genius would not stand a chance if pitted against an “expert” machine.
In 2016 another bastion of human domination fell to the machines. The program AlphaGo defeated champion go player Lee Seedol 4 games to 1. One of the last remaining domains of human intellect in the games world is bridge. While we are putting up a spirited defence against the barbarian computers, here too it is clear that our days are numbered.
We may thus propose the following criterion for positive assessment of the effectiveness of artificial intelligence in the field of games: AI plays a game effectively if it can defeat the human world champion.
Artificial intelligence methods perform extremely well at tasks that are defined by rigid rules. This is the reason why they cope so well with games, where the rules are precisely laid down. In tasks involving the perception of elements of reality – where such rigid rules no longer apply – the competences of the “Matrix” remain, for the moment, behind those of human beings.
Even a three-year-old child will easily recognise that the above photo shows a dog and not a cat. Unfortunately, even an advanced system of artificial intelligence will not be able to answer the question with absolute certainty. (The system presented identifies the image as a dog with 97% certainty, but in the end... who can be sure?)
In the case of perception of the outside world, we do not expect more from AI than we can do ourselves. Instead, comparing its results with those obtained by human perception, we are satisfied if the machine at least comes close to human competence. So what benefit do we derive by using AI, if its interpretation of reality is imperfect? The advantage is that it can perform such tasks incredibly fast. An AI system can look through millions of images a second, and automatically select – even with only 97% certainty – all those that contain a dog. A well-rested human might be able to perform the same task without ever making a mistake, but could process no more than a few images per second.
To evaluate the effectiveness of a computer system in tasks concerning the perception of outside reality, the results achieved by artificial intelligence are compared with those achieved by human perception. We consider that the system performs effectively if there is a high correspondence between what the computer records and what a human perceives. An example task is the identification of spam.
To evaluate the effectiveness of a system, we take a sample – 100 randomly selected e-mail messages, for instance – and then ask a human to classify each of them as either SPAM or NON-SPAM.
The sample with all of the objects classified by a human being is called a gold standard.
Accuracy is a measure that tells us what proportion of the objects in the gold standard sample were classified by the system in accordance with the human assessor’s decision. For example, if the system classifies all of the e-mails in the same way as the human did, its accuracy is 100%. A classifier that achieves an accuracy above 90% can be considered effective.
However, a high accuracy value does not always mean that the system meets our expectations. Let us modify the task to be given to the artificial intelligence system, now defining it as follows: “Identify all e-mails that relate to tourism.”
Assume that three out of the 100 e-mails in the gold standard sample were classified by a human as relating to tourism (in accordance with a typical distribution of the subjects of e-mail correspondence received by a random user), with the remaining 97 being on other subjects. Now imagine that after months of hard work we have developed an AI system that will identify tourism-related e-mails with an accuracy of 95%. Doesn’t sound bad?
But suppose someone else came up with an entirely different idea: they developed a system that achieves higher accuracy by simply classifying all e-mails as not related to tourism. That system, based on the gold standard, will attain an accuracy of 97%!
What conclusion can we draw from this? Accuracy is not a useful measure in a situation where one of the classes is significantly larger than the others. In that case, we need to look for other measures of effectiveness.
Precision is a measure that indicates what proportion of the objects assigned by the system to a class that interests us have been assigned correctly. For example, if our AI system identifies four tourism-related e-mails among the 100 objects in the gold standard sample, and they include the three that were classified as tourism-related by the human assessor, then the precision will amount to 75%.
Of course, this is not a bad result. But again, someone else may produce a system that identifies only one e-mail as relating to tourism, but does so correctly. The precision achieved by that solution will be 100%, even though it overlooked two out of the three relevant objects!
Yet another criterion for evaluation is therefore needed.
Recall is a measure that tells us what proportion of the objects belonging to the class that interests us were correctly identified by the system. In terms of this measure, our solution is unbeatable – it achieves a recall value of 100%, while the aforementioned system that identified just one e-mail about tourism manages only 33%!
Our satisfaction will be short-lived, however, because again someone will quickly develop a solution that identifies as many as 10 tourism-related e-mails, again including the three that were classified as such by a human. This program will record lower precision (just 30%), but in terms of recall it will attain a score of 100%, equal to ours.
So what can we do about this?
The conclusion is that an objective measure serving to evaluate the effectiveness of an artificial intelligence system ought to take account of both precision and recall. This condition is satisfied by a value called the F-measure, which is the harmonic mean of the two aforementioned measures, or twice their product divided by their sum.
Let us compute the F-measure for our solution. First compute the product 2 x 75% x 100%, which gives 150%. Then divide this result by the sum of the measures, namely by 75% + 100% = 175%. This gives an F-measure of 85.7%.
The F-measure for the solution that identified only one object is calculated as follows:
2 x 100% x 33.3% / 133.3% = 50%
while for the solution that identified 10 objects, the F-measure is
2 x 30% x 100% / 130% = 46.2%.
There is some justice in this world after all!
You may ask: why complicate life by defining this last measure of quality as the harmonic mean of the two values, and not as the universally understood arithmetic mean?
The idea is to reward those solutions for which the two measures “co-exist in harmony”. For example, for an object classifier that achieves precision and recall both equal to 50%, both the arithmetic and the harmonic mean will be 50%. However, if we “turn the screw too tight” and construct a solution with a high precision of 90%, at the cost of a low recall of 10%, the arithmetic mean of the two measures will again be 50%, while the harmonic mean will fall markedly – precisely, to a value of 18%.
In working to ensure high precision, we should not forget about maintaining high recall – and vice versa!
Morphological analysis is used to determine how a word is built up. The result of such analysis might be the statement that, for example, the word classes is made up of the stem class and the ending -es, which is used in English to make the plural form of certain nouns and the third person singular form of certain verbs. From this information we may deduce that classes is likely the plural of a noun (or the third person singular of a verb) whose base form, or lemma1, is class.
This type of inference is very useful in such tasks as context search. By decoding the probable context of a submitted query (establishing, for example, who the user is, where they are and what kind of information they are looking for) a document will be identified that most closely corresponds to their needs – even if that document does not contain the exact word forms that were used in the query.
For example, in response to the query “cheap laptops”, the Google search engine might display a snippet2 with the text: “We offer the most popular laptop models … cheaper than anywhere else …”. Note that this snippet does not strictly contain either of the words used in the query. Rather than the word cheap it contains the comparative form cheaper, and rather than laptops it contains the base form laptop.
To be able to inflect words correctly (as in the above example), a context search system needs to be supported by a dictionary that contains all word forms in a given language. In some applications, such a vast resource would just be redundant weight. For the purposes of classifying texts by subject matter, for example, it is enough to use a simplified version of morphological analysis called stemming. This involves cutting off the ending of a word (and sometimes also the beginning) to identify the stem. For example, the result of stemming for the words beauty, beautiful and beautician might be the same in each case: beaut. Stemming is a much faster process than dictionary-based analysis, and is quite sufficient for purposes of thematic classification.
Lemmatisation is a process whereby, for each word appearing in a document, the system identifies all of the lemmas (base forms) from which that word might potentially derive. For example, for the word bases, the lemmatiser (a program performing lemmatisation) will return the information that the word may come either from the noun or the verb base, or from the noun basis. Similarly, for the word number the program will say that it is either the base form of that noun (or verb), or else the comparative form of the adjective numb.
A slightly different type of analytical process is POS-tagging (Part of Speech Tagging). Here, based on context analysis, the POS-tagger (the program performing the task of tagging) is expected to determine unambiguously what part of speech is represented by a particular word used in a sentence. For example, in the sentence She bases her theory on a number of findings, the POS-tagger should identify the word bases as a verb, and the word number as a noun.
Lemmatisers, assuming they are equipped with suitably large dictionaries, do not make errors. Modern POS-taggers, on the other hand, operate with a success rate of around 95–98%, which means that occasionally they will return an incorrect result.
Today’s NLP solutions offer a combination of the two above-mentioned tools. Thus, when analysing a document, they can determine for each word both what part of speech it represents, and what its base form is.
If a computer system does not have a dictionary to check the existence of words, the way in which it analyses the structure of words (to perform stemming, for example) is fairly primitive. Most methods so far developed for this scenario involve cutting off successive letters from the end of a word for as long as suitably defined conditions are satisfied.
One of the most popular methods of stemming is the Porter algorithm, as illustrated in the above screenshot. As we can see, for the word isolation the algorithm identifies the stem isol by cutting off the ending -ation. However, for ration, the result of stemming is the whole of that word. This is because the potential stem remaining after removing the ending -ation in this case, namely the single letter r, is too short – according to the assumptions of the algorithm – for such reduction to be performed. The situation is similar with the words conditional and zonal. The ending -ness is removed from the word madness, and in hopefulness the algorithm removes two endings: the noun suffix -ness and the adjectival suffix -ful. However, in aness (a word invented by the user) the algorithm does not remove the ending, for exactly the same reason as in the case of ration.
The algorithm is based on manually constructed rules. Each rule has two parts: the first part says what conditions must be fulfilled by a potential stem in order for the removal operation defined in the second part to be applied.
An example rule used by the Porter algorithm is the following:
(m>0) ational -> ate
The first part of this rule (m>0) imposes a condition on the size of the stem left after removal of the ending – this size (which is roughly equivalent to the number of syllables) must be greater than zero. The second part of the rule (ational -> ate) indicates how the ending is to be replaced.
According to this rule, the word relational will be replaced by relate, but the word rational will be left unchanged, since the stem r does not satisfy the defined condition on its size.
One of the obstacles to effective automated morphological analysis using a dictionary is human lexical creativity. In any language, new words are being invented practically every day. For Polish, for example, a continuously updated ranking of neologisms can be found at https://polszczyzna.pl/neologizmy-ranking/.
At the time this post was written, the list was headed by the following words:
Moreover, possibilities of creating new words by adding endings or prefixes – whether to native words or those borrowed from other languages (particularly from English, in the case of a language like Polish) – are practically unlimited. Consider such creations as megaweekend, macromaker or instagramming.
Machine algorithms should not be left behind in this regard. For a computer system to have the ability to perform dictionary-based analysis of newly formed words, those words ought to appear in its dictionary.
For this purpose, a system may itself generate potential neologisms, for example by combining common words with popular prefixes. This means that the system’s dictionary may contain words that are not to be found in traditional lexicographical sources – words like supermachine and eco-promotion.
A computer’s lexical creativity must nonetheless be kept under control. Above all, the process of combining lexical units will inevitably lead to the generation of words that are not used in the language. This problem can be effectively eliminated by subjecting each automatically generated word to appropriate verification, and adding to the system’s dictionary only those words that occur with suitably high frequency in a selected corpus of texts (for instance, in the content indexed by the Google search engine).
Worse, though, is that words automatically generated by the system may turn out to be false derivatives – words with an entirely different meaning than would be suggested by their component elements. For example:
re+treat ≠ retreat
anti+gone ≠ Antigone
co+lander ≠ colander
ex+tractor ≠ extractor
e+vocation ≠ evocation
So to answer the question that forms the title of this section: Yes, a computer can be creative. Sometimes too much so!
Morphological analysis – the analysis of the structure of words – is an essential element of many systems for natural language processing. In some applications it is necessary to use a comprehensive dictionary of the language in question. In today’s solutions, the trend is to combine lemmatisation and tagging in a single process.
Sometimes, however, it is sufficient to apply methods where no dictionary is used – limited to determining the stems from which individual words are derived.
Human creativity means that new words are constantly being created. A computer can also be taught to be creative in word formation, although the results may be quite unexpected, and even comical.
1 The lemma, or base form, of a word is the form that appears in dictionaries – for example, the singular form of a noun or the bare infinitive of a verb.
2 A snippet is an extract from the text of a Web page which is returned by a search engine in response to a query.
How does a child normally learn to read in its native language? First it gets to know all the graphical symbols (letters, punctuation marks, accent marks, and so on) that are used to write that language. Next it learns the relationships between symbols and sounds, and learns to connect the sounds into words and phrases while reading. After some time the child is able to interpret even whole sentences at a single glance.
How can we teach a computer to read text? Can it learn like a human does, or does it require a different method? This is the question we will be considering in this post.
Assume that our computer “pupil” already knows the alphabet of a certain language. How can we give it the ability to identify words? This task might seem trivial – you just need to take together all the letters that stand adjacent to each other between spaces! However, as we will soon see, this simple recipe doesn’t always work.
Consider the following text:
No one in Las Vegas spends less than $10 000.
The logical way of dividing up this sentence into units should be this:
No one | in | Las Vegas | spends | less | than | $ | 10 000 | .
This is because the units “No one”, “Las Vegas” and “10 000”, although they contain spaces, constitute indivisible semantic items.However, if we divide the sentence according to the “space to space” rule, we get something quite different:No | one | in | Las | Vegas | spends | less | than | $10 | 000.
The above logical division of the text into units does not entirely correspond to the traditional division into words. In computer processing, in fact, the concept of a token is used. This refers to a sequence of consecutive characters, not necessarily alphabetical. In particular, a token may be a word, a number, a date, a telephone number, or a Web address.
The human activity of dividing a text into words corresponds to the computer term tokenisation, which means dividing a text into tokens.
Full stops (periods) cause significant difficulties when attempting to divide text into tokens. Consider this sentence:
Write to Dr. Kowalski at kowalski@poleng.pl.
The desired division of this text into tokens looks like this:
Write | to | Dr. | Kowalski | at | kowalski@poleng.pl | .However, in order to “tokenise” the sentence in this way, the computer program must be able to distinguish the periods that appear in abbreviations or e-mail addresses from those that mark the end of a sentence. Similar kinds of problems are caused by certain other signs, including hyphens, dashes, apostrophes, and emoticons.How do computers cope with dividing text?
There are two basic approaches to the task of tokenisation. In the case of dictionary- and rule-based methods, lexicons are created to list specific words, such as abbreviations ending with a period. In addition, patterns are defined for selected tokens (such as dates, Web addresses, etc.) – usually in the form of regular expressions – to enable such items to be identified in a text.
In methods that utilise machine learning, training data is supplied in the form of examples of correctly tokenised texts. The system is expected to learn to divide new texts solely on the basis of that data.
There are some languages that do not use separators between words. One of these is the official language of the People’s Republic of China, Mandarin Chinese.
To divide a text in this language into tokens, one may apply what is known as the MaxMatch algorithm.
The MaxMatch algorithm assumes the existence of a lexicon containing all of the words of a given language. Starting from the first character of the text we want to analyse, we identify the longest sequence of characters that appears in that lexicon, and then split it off from the rest of the text. (If no word from the lexicon was found, we split off the first character of the text.) This procedure is repeated for the remainder of the text.Let’s see how this algorithm would work for an example English sentence, if words in that language were not separated by spaces. Consider the sentence:Addtheyeasttothedough.
We would like to see it divided thus:
Add | the | yeast | to | the | dough | .
No luck, however – the MaxMatch algorithm will return the following division:
Add | they | east | tot | he | dough | .
This shows that the spaces we use to separate words in our language are a real godsend!
It would appear that a sentence is the basic fragment of text having a consistent interpretation. We read sentence by sentence; we speak sentence by sentence; we translate texts sentence by sentence. A computer tries to imitate us in this regard by attempting to divide a text into sentences. But an apparently simple rule, like “Divide after a full stop and before a space and a capital letter”, can sometimes let us down:
Undivided text:
In 2019 Mr. Brown visited many different countries, e.g. Germany, France and Canada. He simply loves to travel!
Divided text:
In 2019 Mr. | Brown visited many different countries, e.g. | Germany, France and Canada. | He simply loves to travel!
As you will notice, there are rather too many sentence divisions here.
Like in the case of dividing a sentence into tokens, there are two approaches to dividing a text into sentences:
When neural networks began to play an important role in natural language processing (the “neural breakthrough” in machine translation came in 2014), it turned out that the traditional way of dividing text into tokens was unsuitable. The input layer to neural networks has to consist of a vector of dimension equal to the number of all words used in the document being processed – in extreme cases this can be in the millions. Unfortunately, computations in such enormous networks are still beyond the reach of the machines being built today.
A way therefore had to be found to compress the dictionary – namely to reduce its volume as counted in words. One such method is the BPE algorithm. The input data to this algorithm consists of the text (or set of texts) that is to be processed, along with the target size of the dictionary, which in practical scenarios should not be much greater than 50,000. The items appearing in such a dictionary are known as subwords.
The BPE algorithm is based on the merging of the most frequent bigrams, where a bigram is a pair of neighbouring fragments of text.
Let us analyse the action of the BPE algorithm when it receives as input a target dictionary size of 14 and the following text:
Hurry on, Harry, hurry on!
The initial vocabulary consists of single characters:
{<w>, h, u, r, y, o, n, a}
The size of vocabulary is 8.
In the first step, punctuation marks are removed from the text, and the text is divided into single letters. Word boundaries (denoted here as <w>) are treated the same as a letter:
<w> h u r r y <w> o n <w> h a r r y <w> h u r r y <w> o n <w>
The numbers of occurrences of individual bigrams in the given text are as follows:
<w> h | 3 |
h u | 2 |
u r | 2 |
r r | 3 |
r y | 3 |
y <w> | 3 |
<w> 0 | 2 |
o n | 2 |
n <w> | 2 |
h a | 1 |
a r | 1 |
We begin by merging the first bigram that has three occurrences: <w> h.
<w>h u r r y <w> o n <w>h a r r y <w>h u r r y <w> o n <w>
The current vocabulary now consists of 9 subwords:
{<w>, h, u, r, y, o, n, a, <v>h}
This has eliminated the bigrams h u and h a, since in all cases the letter h has been joined to the <w> sign. Similarly, single instances of the y <w> and n <w> bigrams have been lost.
The next bigram with three occurrences is r r. We now merge this into a single unit:
<w>h u rr y <w> o n <w>h a rr y <w>h u rr y <w> o n <w>
The current vocabulary now consists of 10 subwords:
{<w>, h, u, r, y, o, n, a, <v>h, rr}
This merging leads to the disappearance of the bigrams u r, r y and a r, but we can still merge rr and y.
The current vocabulary now consists of 11 subwords:
{<w>, h, u, r, y, o, n, a, <v>h, rr, rry}
There are no more bigrams with three occurrences, so we go on to those with two. The first of these (in order of appearance) is <w>h u.
<w>hu rry <w> o n <w>h a rry <w>hu rry <w> o n <w>
The current vocabulary now consists of 12 subwords:
{<w>, h, u, r, y, o, n, a, <v>h, rr, rry, <w>hu}
We next merge further bigrams having two occurrences:
<w>hurry<w> o n <w>h a rry <w>hurry <w> o n <w>
<w>hurry<w> o n <w>h a rry <w>hurry<w> o n <w>
In the last of these representations, the dictionary has been increased to 14 subwords, which was the number required:
{<w>, h, u, r, y, o, n, a, <v>h, rr, rry, <w>hu, <w>hurry, <w>hurry<w>}
It is therefore time to finish… both the running of the algorithm, and this blog post.
Humans would like AI to process a natural language in the same way that they do. The first systems for natural language analysis were thus built on rules for building sentences similar to those we remember from grammar lessons. Machine translation systems took their translation rules directly from foreign language textbooks.
As it turned out, however, the real breakthrough in computer natural language processing came only when we allowed machines to come into their own – to learn independently on the basis of data that we supplied to them. Until recently, it seemed that dividing text into words, sentences and paragraphs would remain the domain of rule-based systems, operating based on analogies with actions performed by humans. Today, it seems that in this task, as in many others, machines will find their own way.
The idea of using numbers to represent words, or texts made up of words, is “as old as time”. Texts are converted into number sequences in a process of encryption; then in the process of decryption the reverse operation is performed, in a manner known only to the intended recipient. In this way, the encrypted message cannot fall into the wrong hands. Encrypting tools are reported to have been used even in ancient Greece: “A narrow strip of parchment or leather was wound onto a cane, and text was written along it on the touching edges. The addressee, having a cane of the same thickness, could quickly read the text of the message. When unfurled, to show meaningless scattered letters, it would be of no use to a third party; it was understandable only to the intended recipient, who would match it to his template” (https://en.wikipedia.org/wiki/Scytale).
As early as 1949 the American mathematician Warren Weaver formulated the thesis that methods of encryption and decryption might be applied automatically – with the use of calculating machines – to translate text from one language to another. However, his vision would come to pass only 65 years later, when the first articles appeared on the practical use of neural networks in machine translation. Why not earlier? No doubt because nobody was then able to represent text numerically in such a way that the operations performed by neural networks would not cause the meaning to be lost.
It would seem at first sight that words could be represented in the form of successive natural numbers – in alphabetical order, for example – at least within the scope of the document being processed. Let us assume, then, that we wish to process with a computer (for purposes of translation or knowledge extraction, say) a document consisting of the following four sentences:
Austria’s capital is Vienna.
Australia’s capital is Canberra.
Austria’s currency is euros.
Australia’s currency is dollars.
To enable the methods of machine learning to be applied, this document must be converted to numerical form. So let us arrange all the words appearing in the text in alphabetical order (ignoring the issue of upper and lower case letters), and then assign successive natural numbers to the words in this sequence:
australia’s: 1
austria’s: 2
canberra: 3
capital: 4
currency: 5
dollars: 6
euros: 7
is: 8
vienna: 9
Let us suppose that each of the above four sentences is to be presented as a vector of the words appearing in it, with each word represented by its corresponding number. The first two sentences will then appear as follows:
Austria’s capital is Vienna → {2, 4, 8, 9}
Australia’s capital is Canberra → {1, 4, 8, 3}
We would now like to define a numeric representation for both of the above sentences in combination. The natural operation would appear to be addition of the two vectors:
{2, 4, 8, 9} +
{1, 4, 8, 3} =
{3, 8, 16, 12}
The resulting vector contains two values that do not index a word appearing in the document (16 and12), as well as values corresponding to the words “canberra” and “is”. It turns out that the occurrences of “austria’s” and “australia’s” summed to “canberra”, and the two occurrences of the word “capital” summed to give the word “is”!
Neural networks can of course perform much more complex operations on numbers (apart from adding, they can also multiply and apply functions); however, if using the above representation, calculations of this type would be founded on the unjustified assumption that, for example, the word “canberra” has a value which is the sum of the values of the words “austria’s” and ‘australia’s”, which makes no sense semantically. For this reason, it is necessary to look for a completely different way of representing words in numerical form.
Let us assume that all words have the same value of exactly 1. We again order the set of words contained in a document in a certain defined manner (alphabetically, say), but this time representing each word with a vector, having the value 1 in the appropriate (“hot”) position, and zero everywhere else.
For example, the word “australia’s” (first in the alphabetical sequence) has a 1 in the first position:
{1, 0, 0, 0, 0, 0, 0, 0, 0}
while the word “capital” (fourth in the sequence) has a 1 in the fourth position:
{0, 0, 0, 1, 0, 0, 0, 0, 0}
The representation of a whole sentence is now obtained by determining the logical sum of the vectors corresponding to all of the words in the sentence (where the logical sum of two 1’s is 1). For example, the representation of the sentence:
Austria’s capital is Vienna
Is determined as follows:
{0, 1, 0, 0, 0, 0, 0, 0, 0} +
{0, 0, 0, 1, 0, 0, 0, 0, 0} +
{0, 0, 0, 0, 0, 0, 0, 1, 0} +
{0, 0, 0, 0, 0, 0, 0, 0, 1} =
{0, 1, 0, 1, 0, 0, 0, 1, 1}
We can use a similar procedure to combine the information contained in sentences. For example, the combined representation of the first and second sentences (“Austria’s capital is Vienna” and “Australia’s capital is Canberra”) takes the form:
{1, 1, 1, 1, 0, 0, 0, 1, 1}
This vector has 1’s in the positions corresponding to the words that appear in either the first or the second sentence.
The representation of the whole document (all four sentences given above) has the form:
{1, 1, 1, 1, 1, 1, 1, 1, 1}
This method of representing words numerically is known as one-hot encoding.
An obvious drawback of one-hot encoding is the fact that it takes no account of how many times particular words appear. Even if the word “capital” appears 1000 times in a given document, it will be given the same weight as the word “kale”, as long as the latter appears at least once in the document (even by mistake). A system attempting to identify the topic of that document might mistakenly conclude that the text concerns healthy eating, rather than political administration.
A commonly used coding method, therefore, is based on number of occurrences: the vector contains the numbers of instances of particular words. In this case, our four-sentence document might be coded as:
{1, 1, 1, 1, 0, 0, 0, 1, 1}
For example, the word “is”, which corresponds to the eighth element of the vector, appears in the document four times.
In sets consisting of multiple documents, counting occurrences of words will favour those that appear in longer documents (there will naturally be more occurrences of words in those documents). To give “equal chances” to words that come from shorter documents, a representation based on term frequency is used. Here, the number of instances of a word is divided by the total number of words in the document in question. Accordingly, our example document is represented by the following vector, in which each number representing the count of instances of a particular word is divided by 16 (the total number of words in the document):
{1/8, 1/8, 1/16, 1/8, 1/8, 1/16, 1/16, 1/4, 1/16}
The expression term frequency can be abbreviated to TF.
Consider a computer system designed to identify the topic of a given text. It would seem appropriate to take account, first of all, of those words that occur in the text with the highest frequency. But is that the right approach? In our example document the most frequently occurring word is found to be “is”, which unfortunately gives us little idea what the topic might be.
This problem can be overcome using TF-IDF representation. This takes account of the frequency of particular words in the document (TF), but reduces the value of common, “uninformative” words (“is”, “the”, “in”, etc.) by applying a second factor, called the inverse document frequency, or IDF.
Let us assume that apart from our example document, the system also has three other documents to analyse (we are thus dealing with a set of four documents in all). Let us also assume that each of the other three documents contains the word “is”, but none of them contains any of the other words used in our example document. A word’s IDF is calculated as the total number of documents in the set divided by the number of documents containing that word. For the word “is” the IDF is 1 (4/4), while for the other words it is 4 (4/1).
In the TF-IDF representation, the two factors (TF and IDF) are multiplied (in practice the second one is usually transformed using a logarithmic function, but we overlook this detail for the moment). The TF-IDF representation of our example document will therefore be:
{1/2, 1/2, 1/4, 1/2, 1/2, 1/4, 1/4, 1/4, 1/4}
This is more like what we expect! The words with the highest value are now “Austria’s”, “Australia’s”, “capital” and “currency”, which accurately reflect the topic of the text under analysis.
The one-hot representation and frequency-based representation provide information on the relationship between words and documents (they indicate whether a word appears in a document, how frequently it appears, and how that frequency compares with its frequency in other documents). However, they provide no information as to the meanings of the analysed words.
Does there exist any way of representing numerically the meaning of words? We are helped here by the distributional hypothesis, which says that words appearing in similar contexts within large sets of text data are likely to have similar meanings.
Our example document largely confirms that hypothesis: pairs of words appearing there in similar contexts include <dollars, euros>, <canberra, vienna> and <capital, currency>. The first two of these pairs can readily be seen to consist of words of similar meaning.
It is quite likely that the words “canberra” and “vienna” will also appear in similar contexts in other documents. However, the probability that the words from the last pair, “capital” and “currency”, will appear in similar contexts would seem to be much lower. It is for this reason that the statement of the distributional hypothesis refers to “large sets of text data”.
Let us now try, for the words “austria’s” and “australia’s”, which have similar meaning, to build their vector representation using information about the words that occur together with them in the same sentences within our example document.
Recall that our example document looks like this:
Austria’s capital is Vienna.
Australia’s capital is Canberra.
Austria’s currency is euros.
Australia’s currency is dollars.
The words appearing in the above document, arranged in alphabetical order, are as follows:
australia’s: 1
austria’s:: 2
canberra: 3
capital: 4
currency: 5
dollars: 6
euros: 7
is: 8
vienna: 9
Based on the first sentence of the document, we will now build a vector for the word “austria’s”, containing information about the number of instances of words that co-occur with it in the analysed sentence (a 1 appears in the positions corresponding to the words “capital”, “is” and “vienna”, and the remaining places have 0):
{0, 0, 0, 1, 0, 0, 0, 1, 1}
We can build a similar vector for the word “australia’s” based on the second sentence in the document:
{0, 0, 1, 1, 0, 0, 0, 1, 0}
These vectors are quite similar to each other: they differ in only two places out of nine.
In a similar way, we can build vectors for the same two words based on the third and fourth sentences
{0, 0, 0, 0, 1, 0, 1, 1, 0} (vector of words co-occurring with “austria’s”)
{0, 0, 0, 0, 1, 1, 0, 1, 0} (vector of words co-occurring with “australia’s”)
We can now sum the two vectors for the word “austria’s”:
{0, 0, 0, 1, 0, 0, 0, 1, 1} +
{0, 0, 0, 0, 1, 0, 1, 1, 0} =
{0, 0, 0, 1, 1, 0, 1, 2, 1}
Similarly, we can sum the information from the two vectors for the word “australia’s”:
{0, 0, 1, 1, 0, 0, 0, 1, 0} +
{0, 0, 0, 0, 1, 1, 0, 1, 0} =
{0, 0, 1, 1, 1, 1, 0, 2, 0}
Are the vectors computed for the words “austria’s” and “australia’s” over the whole document similar to each other? We can try to determine this similarity by computation.
Is there a simple formula for calculating the similarity of vectors representing different words? There is. It is based on cosine similarity, a value that always lies in the range from 0 to 1. For identical vectors, the similarity is 1; for vectors that are as different from each other as possible, the similarity is 0.
This is the formula:
We first calculate the scalar product of the vectors representing the words “austria’s” and “australia’s” respectively (the numerator in the formula):
A * B = (0 + 0 + 0 + 1 + 1 + 0 + 0 + 4 + 0) = 6
Then we calculate the Euclidean lengths of the vectors (which appear in the denominator):
||A|| = sqrt(1 + 1 + 1 + 4 + 1) = sqrt(8)
||B|| = sqrt(1 + 1 + 1 + 1 + 4) = sqrt(8)
similarity = 6 / (sqrt(8) * sqrt(8)) = 3/4
It turns out that the measure of the similarity between the two vectors is indeed quite high.
The problem with the representation based on the distributional hypothesis is the dimension of the vectors involved. This is equal to the number of different words appearing in the whole set of documents, which in extreme cases may reach into the millions. Vectors with such large numbers of elements are not suited to neural network calculation, for example.
Word2Vec is a representation for words that is based on the distributional hypothesis, but has the additional property that the vector dimension will be no larger than a few hundred. Moreover, the dimension does not depend on the size of the dictionary, but may be specified arbitrarily by the system architect. The Word2Vec representation thus reflects the relation of similarity between words, while at the same time enabling the completion of NLP tasks (machine translation, sentiment analysis, automatic text generation, etc.) with the use of neural networks.
The output below contains lists of words whose vectors, computed on the basis of Wikipedia texts, are the most similar to the vectors for the words given at the top of the columns: “nagoya” for the left-hand column, and “coffee” for the right-hand column. This means that these are the words for which the measure of similarity attains its highest value.
There exist many ways of representing words with numbers. Currently the most popular are numerical vectors constructed by the Word2Vec method. The dimension of such a vector is defined by the system architect; there are most commonly between 300 and 500 coordinates. The vectors reflect the meanings of words in such a way that words of similar meaning correspond to vectors for which the similarity measure takes a high value.
One of the main advantages of storing documents in digital form is the ease of searching them for particular words and phrases. If you had the paper version of the book A Game of Thrones, and you wanted to find the first time the character name “Daenerys” appeared, it would be like looking for the proverbial needle in a haystack. But in the digital version? Simply use the Find function. Replacing text is just as easy: using a global Replace command, you could, for example, change all instances of “Daenerys” to the alternative spelling “Denerys”.
However, the Find and Replace functions do not enable you to seek out simple patterns that a human would easily spot with the naked eye. A practised reader need only glance at a page of text to identify all instances of dates, postal addresses or e-mail addresses. But how to do this using the Find function? Unfortunately, it is by no means easy!
The task of seeking or replacing various text fragments that have a regular structure can be carried out using a mechanism known as regular expressions.
Put simply, a regular expression is a pattern representing the set of all text strings that match the pattern. For a given regular expression, a suitably constructed computer mechanism is able to search a text and locate all strings in it that match the expression.
In practice, the program seeking strings that match a given regular expression often returns not quite the results that were expected. How is this possible? No doubt the creator of the regular expression made a mistake – and the most common reason for errors is misunderstanding of the priorities of operations.
The order of priority of operations in regular expressions is as follows:
The regular expression “(ha)*” is matched by ha, haha, hahaha, etc., because we first group the characters h and a by means of parentheses, and afterwards apply repetition by means of the asterisk. The expression “ha*”, however, is matched by h, ha, haa, haaa, etc., because the operation of repetition is first applied to the character a, and afterwards the result is combined with the remainder of the character sequence.
The regular expression “Kowalski|a” is matched by Kowalski and by a (but not by Kowalska), since concatenation of characters has a higher priority than the specification of an alternative.
Different regular expressions may represent the same set of strings. For example, the expression “cat|dog” is equivalent to “dog|cat”, and “(cat)*” is equivalent to “((cat)*)*”. This property of regular expressions would appear to be highly advantageous – it means that a search for a particular text pattern can be expressed in different ways (just as the correct answer to a mathematical exercise can be reached in a variety of ways). The problem is that the creator of a regular expression may believe that two forms are equivalent when in fact they are not. For example, the expression “Kowalski|a” is not equivalent to “Kowalski|Kowalska”.
The operations listed above make up the basic (mathematical) definition of regular expressions. However, computer scientists have gone a bit further than mathematicians, introducing a whole series of additional operations. The most popular of these are character classes, the dot character, quantifiers, and anchors.
Character classes – written in square brackets – are another way of representing alternatives. For example, the expression “[AEO]la” is matched by Ala, Ela and Ola. The expression “[A-Z]la” is also matched by Bla, Cla, Dla, etc.
It is also possible to define negative classes. The expression “[^ds]” (with the caret symbol after the opening bracket) matches all characters except for d and s. Thus, the regular expression “[^ds]ay]”, given the text:
They say today is a day away.
will match only the string way, appearing as part of the word away.
Character classes can also be written using certain abbreviations; for example, “\d” denotes any digit, “\w” denotes any letter, and “\s” denotes any white character (space, tab or newline).
The dot (period, full stop) is a special character. When used in a regular expression it will be matched by any character at all. So the expression “c.t” will be matched by cat, cot and cut, but also by c5t, for instance.
If you want a regular expression to represent a character that would normally be a special character (such as a full stop), you must place the symbol \ before that character. For example, the regular expression “ai\.POLENG\.pl” may be used to find the e-mail address of an employee of our company.
Quantifiers are used to search for repetitions. One of them is the asterisk special character that was mentioned earlier.
The special character + stands for at least one repetition. For example, “hur+ah+” is matched by hurah, hurrah, hurahh, etc., but not by hua, hura or huah.
The special character ? denotes zero or one instance. The expression ‘Mr\.?’’ matches both “Mr.” with a period, and “Mr” without one.
Quantifiers can also specify the number of repetitions. The regular expression “\d{4}” matches exactly four digits, while “\w{5,10}” is matched by a string of five to ten letters.
Anchors specify the place in the text where the string is to occur.
The special character ^ indicates that the string must appear at the very start of the text.
The special character $ indicates that the matched string must appear at the end of the text.
The regular expression “^\d+$” thus matches only texts consisting entirely of digits.
The anchor \b denotes a word boundary. The expression “\bcat\b” will not be matched to any string in the text We have two cats, because there is no word boundary immediately following the string cat.
The possibilities of text searching using regular expressions can be expanded with the use of flags. For example, the flag i means that upper and lower case letters are to be treated as identical when matching strings to the pattern, and the flag g means that the expression handling mechanism is to find all strings matching the pattern, not only the first of them.
A flag is not part of a regular expression. The information that a flag is to be applied is supplied separately by the user of the mechanism (in a manner that depends on the particular solution being used).
([12]\d{3}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01]))
The above regular expression matches the currently most common date range and format. It requires that the date begin with the digit 1 or 2 ([12]), followed by any three digits (\d{3}). The next character is a hyphen (-), and after that comes the month: either a two-digit string beginning with zero (but not 00), or one of 10, 11, 12 (0[1-9]|1[0-2]). After another hyphen (-) comes the day – this consists of two digits, of which either the first is zero and the second is from 1 to 9 (0[1-9]), or the first is 1 or 2> and the second may be any digit ([12]\d), or else the first is 3 and the second is 0 or 1 (3[01]).
\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b
The above expression – when used with the flag i – matches most e-mail addresses. It requires the address to begin with at least one letter, digit or other permitted character (b[A-Z0-9._%+-]+), after which comes the @ character, and then at least one string ending in a dot ([A-Z0-9.-]+\.), followed by a domain code consisting of at least two letters ([A-Z]{2,}).
There are many online tools you can use to check whether a regular expression you have constructed “works” exactly as it should. One of them is available at regex101.com.
Write your regular expression in the edit window, and in the text field insert the text that is to be searched. The tool will highlight the first occurrence in that text of a string that matches the regular expression being tested.
Notice that in the above example the search mechanism is applying the flag i (which was enabled by clicking at the right-hand side of the edit window). If that flag is disabled, the e-mail address will not be found:
One of the main reasons why more and more texts are stored digitally is the ease of searching for information. Although combing the haystack for a single needle (a specific string) is a relatively simple task, and is thus available in virtually any text editor, finding multiple needles (all strings of a specified type) is far less trivial.
Regular expressions are currently the most popular mechanism for searching texts for strings of a specified type. It is therefore well worth becoming familiar with them – or at least learning the basics, as you have done now you have read this post.
For a computer to be able to conduct a conversation like a human, it needs to master four elements of dialogue:
Speech recognition is a process in which the computer listens to a human utterance and converts it to written form. The process can be seen on certain news channels, for example, where the words being spoken by a journalist appear on the screen as subtitles almost immediately.
In recent years, systems of this type have made huge progress in recognising speech signals in practically any of the world’s natural languages. Not long ago it was close to a miracle if a car navigation system understood a spoken destination address correctly; today we are surprised if even the smallest error occurs in understanding that kind of spoken text.
Just a few years ago, a computer-generated voice would jar with its unvarying rhythm and unnatural intonation. Today, you need an especially sensitive ear to distinguish a synthetic voice from a human one.
Text generation is a process of creating a text in a natural language based on knowledge encoded in the computer. For example, based on information of the type: “<departure station: Kings Cross; operator: LNER; destination station: Newcastle; delay: 30>” the computer system can generate the text: “The LNER train from Kings Cross to Newcastle is delayed by thirty minutes”.
Text generation can be done using a human-designed set of rules or templates; it is a much easier task than language comprehension, which we discuss next.
We can say that a computer system understands a natural language if, in response to a human utterance, it takes action in accordance with the speaker’s intention. For example, in response to a question, the system should provide a relevant piece of information; in response to a command, it should perform the requested action; and in response to an expression of opinion, it should continue the dialogue in the way a human would.
Understanding natural language is the hardest problem in computerised conversation, and is still far from being solved. This is because it is very hard to develop a finite set of rules or principles that will take care of the infinite number of possible human utterances.
However, if we accept that the system is to operate exclusively within a certain simplified reality, in which the user’s utterances are limited in vocabulary and structure, it then becomes possible – and not even especially complicated – to write a computer program that will understand natural language.
In this blog post I will describe, step by step, how such a system can be created. The example given will be written in Python, but even without knowledge of that or any other programming language, the reader will be able to learn how to build effective systems of this type.
The system in this example will be a virtual trader, doing business in accordance with the mission statement: “I buy in order to sell.” It will be an uncomplicated system, responding to short sentences with simple syntax. To broaden its potential market, our trader will communicate in English, and its partner in conversation will give it commands of the type:
Buy 5 large blue metal boxes.
Sell 3 tiny red plastic rings.
The trader will do everything to make a suitable deal in every case. It will accept any purchase instruction, recording the relevant information in its inventory. It will carry out a sale instruction only if it has previously bought the required quantity of goods.
We will accept that the system understands what is said to it if its inventory is always in agreement with the purchases made and the sale transactions concluded.
Our program is expected to work in the following way:
The user gives a command in a natural language, e.g.:
Buy 5 large blue metal boxes.
We will have to create a dictionary containing all of the words that the system is expected to understand. These will be called tokens. First, however, we need to define the permissible meanings of words, called token types. The corresponding Python source code takes the following form:
tokens = ( 'OPERATE', 'NUMBER', 'SIZE', 'COLOR', 'MATERIAL' 'KIND', )
We have defined six token types, corresponding to different meanings of tokens. In a moment we will determine, for example, that the dictionary can contain a token “Buy” belonging to the OPERATE type.
Guess to which of the above types the tokens in this sentence belong:
“Buy 5 large blue metal boxes.”
Answers in a moment…
At the next step, we define for each type what tokens belong to it and what the system has to do when a specified token appears in the user’s command:
def t_OPERATE(t): r'Buy | Sell' return t
The above code assigns to the OPERATE type the two words Buy and Sell (second line), and instructs the system to remember the given operation when such a token appears in the user’s command (third line).
def t_NUMBER(t): r'\d+' t.value = int(t.value) return t
The above code assigns to the NUMBER type any text consisting purely of digits (second line), transforms that text into the corresponding numerical value (third line), and instructs the system to remember that value whenever such a token appears in the user’s command (fourth line).
def t_MATERIAL(t): r'metal | plastic' if t.value =='metal': t.value = 1 elif t.value == 'plastic': t.value = 2 return t
The above code assigns to the MATERIAL type the words metal and plastic. Next (and this is something new!) it instructs the system to store either 1 (if the material is metal) or 2 (if the material is plastic).
We can construct the other token types analogously to MATERIAL:
def t_COLOR(t): r'(black | white | red | green | blue)' if t.value =='black': t.value = 1 elif t.value == 'white': t.value = 2 elif t.value == 'red': t.value = 3 elif t.value == 'green': t.value = 4 elif t.value == 'blue': t.value = 5 return t def t_SIZE(t): r'tiny | small | big | large' if t.value =='tiny': t.value = 1 elif t.value =='small': t.value = 2 elif t.value =='big': t.value = 3 elif t.value =='large': t.value = 4 return t
def t_KIND(t): r'box(es)? | ring(s)?' if t.value[0] =='b': t.value = 1 else: t.value = 2 return t
In the definition of the KIND type we ensure that the dictionary contains the words box, boxes, ring and rings. When a token referring to boxes appears, the value 1 is stored; and when the token refers to rings, the value 2 is stored.
How does our computer trader identify the index of the goods mentioned in a command? This is made clear by the following table, showing the values defined for particular tokens.
Value | KIND | SIZE | MATERIAL | COLOR |
1 | box(es) | tiny | metal | black |
2 | ring(s) | small | plastic | white |
3 | big | red | ||
4 | large | green | ||
5 | blue |
When the system has to determine the index of an item being bought or sold, it does so in the way indicated by the above table – where the ordering of the columns is significant.
For example, to determine the index of the good referred to as:
large blue metal boxes
the system must first arrange the given attributes in an order corresponding to the columns of the table: KIND, SIZE, MATERIAL, COLOR. This gives in this instance:
boxes large metal blue
It then assigns digits to the tokens as given in the table:
1415
(boxes → 1; large → 4; metal → 1; blue → 5)
The index of the good large blue metal boxes is therefore 1415. Notice that even if the user had stated the attributes in a different order, the resulting index would still be the same.
The virtual trader will only understand commands given in accordance with the rules of grammar defined by the programmer. Every rule indicates how smaller parts of a sentence can be put together to make a larger whole. For example, the following rule says that a command must contain, in order, a token denoting an operation, a token denoting a number, and a component denoting the article:
command: OPERATE NUMBER article
A grammar rule can also contain instructions which the program is to perform when it finds that the analysed text conforms to that rule.
Let us begin with the simplest rules, concerning the attributes of goods:
def p_attribute_color(p): 'attribute : COLOR' p[0] = p[1]
The rule denoted p_attribute_color (the name of a rule must begin with “p_”, but the rest of the name is freely chosen by the programmer) says that a part of the sentence with the name attribute may consist exclusively of a token of COLOR type (second line). The third line says that in this situation the value of the attribute component (denoted p[0]) is to be identical to the value of the COLOR token (p[1] denotes the first, and in this case only, component).
Thus, if the attribute black appears in the command, it will have the value 1, the attribute white will have the value 2, and so on.
def p_attribute_material(p): 'attribute : MATERIAL' p[0] = 10 * p[1]
The rule with the name p_attribute_material says that a part of the sentence with the name attribute may alternatively consist exclusively of a token of MATERIAL type (second line). The third line says that in this situation the value of the attribute component is to be the value of the MATERIAL token multiplied by 10.
Thus the attribute metal has the value 10, and the attribute plastic has the value 20.
Analogously:
def p_attribute_size(p): 'attribute : SIZE' p[0] = 100 * p[1]
The description of an article may consist only of the kind of good:
def p_article_kind(p): 'article : KIND' p[0] = 1000 * p[1]
The value will be 1000 for boxes and 2000 for rings.
The description of an article may be preceded by an attribute of any type:
def p_article_attribute(p): 'article : attribute article' p[0] = p[1] + p[2]
Adding an attribute to the description of an article increases its value. For example, the description “boxes” has the value 1000, “metal boxes” has the value 1010, “blue metal boxes” has the value 1015, and “large blue metal boxes” has the value 1415.
The most important rule concerns the entire command:
# Main parser rule (command) def p_command(p): 'command : OPERATE NUMBER article' index = p[3] #Buy article if p[1] == 'Buy': tab[index] += p[2] print('OK. I am buying ' + str(p[2])+ ' new articles indexed as ' + str(index) +'.') print('No of articles in shop: '+ str(tab[index])) #Sell article elif p[1] == 'Sell': if p[2] > tab[index]: print('I do not have as many articles as you want.') else: tab[index] -= p[2] print('OK. I am selling ' + str(p[2])+ ' articles indexed as ' + str(index) + '.') print('No of articles in shop: ' + str(tab[index]))
The first line states that the analysed command must consist of, in order, the operation (an OPERATE token), the number of articles (a NUMBER token), and a description of those articles (article).
The index of the good referred to in the command is taken to be the value of the article (p[3]). In the case of a purchase, the inventory record for the good is increased by the value of the NUMBER token, and an appropriate message is returned. In the case of a sale, the system additionally checks whether the required quantity of the good is available.
What can I do for you?
Buy 10 large blue metal boxes
OK. I am buying 10 new articles indexed as 1415.
No of articles in shop: 10
What can I do for you?
Buy 5 tiny white plastic rings
OK. I am buying 5 new articles indexed as 2122.
No of articles in shop: 5
What can I do for you?
Sell 5 large blue metal boxes
OK. I am selling 5 articles indexed as 1415.
No of articles in shop: 5
What can I do for you?
Sell 3 tiny white plastic rings
OK. I am selling 3 articles indexed as 2122.
No of articles in shop: 2
What can I do for you?
Sell 3 tiny white plastic rings
I do not have as many articles as you want.
What can I do for you?
Bye
>>>
The full code of the program is given below (with essential “decorations”).
# Import lexer and parser from ply module import ply.lex as lex import ply.yacc as yacc # List of token types. tokens = ( 'NUMBER', 'OPERATE', 'SIZE', 'KIND', 'COLOR', 'MATERIAL' ) # Token types may be defined as regular expressions, e.g. r'Buy | Sell' def t_OPERATE(t): r'Buy | Sell' return t def t_MATERIAL(t): r'metal | plastic' if t.value =='metal': t.value = 1 elif t.value == 'plastic': t.value = 2 return t def t_COLOR(t): r'(black | white | red | green | blue)' if t.value =='black': t.value = 1 elif t.value == 'white': t.value = 2 elif t.value == 'red': t.value = 3 elif t.value == 'green': t.value = 4 elif t.value == 'blue': t.value = 5 return t def t_SIZE(t): r'tiny | small | big | large' if t.value =='tiny': t.value = 1 elif t.value =='small': t.value = 2 elif t.value =='big': t.value = 3 elif t.value =='large': t.value = 4 return t def t_KIND(t): r'box(es)? | ring(s)?' if t.value[0] =='b': t.value = 1 else: t.value = 2 return t def t_NUMBER(t): r'\d+' t.value = int(t.value) return t # Lexer error handling rule (Handle words out of vocabulary) def t_error(t): print("Illegal character '%s'" % t.value[0]) t.lexer.skip(1) # Ignore white spaces t_ignore = ' \t' # Main parser rule (command) def p_command(p): 'command : OPERATE NUMBER article' index = p[3] #Buy article if p[1] == 'Buy': tab[index] += p[2] print('OK. I am buying ' + str(p[2])+ ' new articles indexed as ' + str(index) +'.') print('No of articles in shop: '+ str(tab[index])) #Sell article elif p[1] == 'Sell': if p[2] > tab[index]: print('I do not have as many articles as you want.') else: tab[index] -= p[2] print('OK. I am selling ' + str(p[2])+ ' articles indexed as ' + str(index) + '.') print('No of articles in shop: ' + str(tab[index])) # Parser rule (attribute) def p_attribute_color(p): 'attribute : COLOR' p[0] = p[1] # Parser rule (attribute) def p_attribute_material(p): 'attribute : MATERIAL' p[0] = 10 * p[1] # Parser rule (attribute) def p_attribute_size(p): 'attribute : SIZE' p[0] = 100 * p[1] # Parser rule (article - stop) def p_article_kind(p): 'article : KIND' p[0] = 1000 * p[1] # Parser rule (article - recursion) def p_article_attribute(p): 'article : attribute article' p[0] = p[1] + p[2] # Syntax error handling rule def p_error(p): print("Syntax error in input!") ####################################### #Main program #Initialize table of articles (zero articles at the beginning) tab = [] for index in range(3000): tab.append(0) #Build the lexer lexer = lex.lex() #Tokenize (short version) # for tok in lexer: # print(tok) #Build the parser parser = yacc.yacc() #Main loop while True: s = input('What can I do for you? \n') if s == 'Bye': break parser.parse(s)
It is commonly believed that the processing of text documents requires knowledge of a high-level programming language. A decade or so ago it was considered proper to have good knowledge of the Perl language, while today a specialist in the field “absolutely must” have mastery of Python. But is this knowledge really indispensable?
In this post I will show that even sophisticated tasks in text processing can be completed quickly and easily without knowledge of a single command in any programming language.
Let’s begin with the simplest task: checking how many words are contained in a given text document. This can be done using the following sequence of instructions:
If the result is to be stored (saved in some file), the sequence of instructions may take the form:
There are instructions like this available in the Linux operating system. To obtain the desired result, you need to enter the appropriate command names, which have to be learnt. However, I can recommend something much more attractive: a graphical program in which, instead of entering hard-to-memorise commands, you can simply move pieces of a jigsaw.
Go to the page https://s416072.students.wmi.amu.edu.pl/, and you will see a screen like this:
Among the more than 20 jigsaw pieces on display, drag those that you want to use onto the bar at the bottom. For example, to count the words in a document:
Note that the first piece (Read File) has a protrusion on the right-hand side only, which means that another element is expected only on that side. The shape of the second piece suggests that other pieces should appear on both sides, whereas the third (Save To File) can appear only at the end of the jigsaw.
In what document do we want to do a word count? Bash Box provides four options:
If you select the file containing information on young offenders, the content of that document will be displayed on the screen:
The first row contains the information that in the year 2010 there were 741 female young offenders aged 17. Subsequent rows contain similar information for different ages and sexes.
To count the number of words in this document, we “translate” the sequence of instructions from the jigsaw into a corresponding Linux command. This is done using the button highlighted below:
A special environment, popularly called the “black window”, is available in the Linux system for issuing commands. This is the Bash shell. On translation, our sequence of instructions takes the following form in the Bash shell:
This code looks somewhat complicated, but let’s try to analyse it.
In the window at the side you can see the result of the whole pipeline or sequence of commands:
This tells us that the document on young offenders contains a total of 819 words.
Bash Box also lets you play at text processing by solving puzzles in the style of stories from westerns. To do this, use the following button at the top left of the screen:
Let’s solve the first puzzle:
The format of the document from the database mentioned in the story is displayed alongside the puzzle:
To obtain the required answer, we have to perform the following steps:
We drag suitable jigsaw pieces to the bar on the screen:
Now we check whether the answer obtained is correct:
The Check button turns orange, which means that the answer given was wrong! We therefore drag the third piece back to its original place, replacing it with another piece:
Now the Check button turns green, which means we can happily move on to the next of the eight puzzles.
If you were to ask a computer science student how to obtain a frequency list of all words appearing in the complete works of Shakespeare, for example, they would probably answer that it would be simplest to write suitable code in the Python programming language. There is a much easier way of doing it, however – by means of a single Bash shell command:
cat sh.txt | tr -sc ’A-Za-z’ ’\n’ | sort | uniq -c > sh_frequency_list
Assume that the text file sh.txt contains Shakespeare’s complete works. Then:
Processing text documents in Linux is extremely effective. With a single command you can achieve the same results as you could with complicated programs. The only difficulty to overcome is learning the available commands and understanding how to arrange them in a pipeline.
This is certainly worth doing, however – and here the Bash Box platform can be of enormous help!
In a previous post on this blog I discussed the task of classification, which involves determining automatically which class a given object belongs to. A classification system assigns to each object a class label, where the number of possible labels is defined in advance.
In regression, on the other hand, a machine learning system assigns to each object a certain numerical value, out of an infinite range of possibilities.
Let us consider the example of a set of TripAdvisor reviews for a particular restaurant. This is a selected review from the set:
For a dataset like this, we might define at least a few different classification tasks – for example:
By contrast, an example regression task for the same dataset might be the following:
Rate the quality of the service being described in the review by means of a real number in the interval from 0 to 5.
The value computed by a regression system based on a review might be, for instance, 3.45. We expect the value returned by the system to correspond adequately to the opinion expressed in the review: if the opinion is very positive, the value should be close to 5, but if it is strongly negative, the value should be close to zero.
Regression provides more precise information than classification. We can use it, for example, to track the progress a restaurant is making in improving its quality of service, by comparing the values derived from its reviews at different times. A regression system can also be used with this type of analysis to predict future values of stock market indices – and in this case accuracy, even to many decimal places, can be crucial.
The regression method can be used to compute a predicted value based on an object’s features. In machine learning, a feature of an object means a property that can be expressed numerically. A feature of a review might be, for example, the number of words contained in it which carry a positive meaning, or the number of words with negative overtones. When developing a computer system using the regression method, the system engineers themselves define the set of features to be taken into consideration.
In the method of linear regression, it is assumed that the value to be predicted can be computed using a linear function of feature values. We may assume, for example, that the value of an opinion is directly proportional to the number of words with positive meaning contained in it. This is expressed by the formula:
VALUE = α x NUMBER OF POSITIVE WORDS
where α is some positive number.
We might additionally assume that the predicted value is reduced by an amount proportional to the number of words with negative meaning:
VALUE = α x NUMBER OF POSITIVE WORDS
+ ? x NUMBER OF NEGATIVE WORDS
where ? is some negative number.
If the function defined in this way returns, for example, values ranging from –1 to 4, and we want the results to lie in the range 0 to 5, we can shift the range of returned values by adding a “displacement parameter”:
VALUE = α x NUMBER OF POSITIVE WORDS + ? x NUMBER OF NEGATIVE WORDS + ?
In this case, the linear regression task is to find values of the coefficients α, ? and ? such that the values calculated using the above formula conform to expectations.
If we use regression to evaluate objects (e.g.: What is the reviewer’s attitude to the reviewed service, on a scale of 0 to 5?), the method is said to conform to expectations if the values that it returns are largely in agreement with human intuition.
If we use a regression method for forecasting (e.g.: What will be the value of the stock market index next month? What will be the value of textbook sales in September?), the results conform to expectations if there is only a small difference between the predictions and the actual values (which can be checked after the fact).
Regression coefficients are computed from training data: a set of objects for which we know both the values of their features and the values being predicted. For example, a training data object might be the restaurant review shown earlier in this post. This contains five positive words (mega, nice, best, thank, pleasant) and one negative word (not), and the point score awarded by the reviewer (which in this case is the value we will be trying to predict) is 5, as shown by the five green dots at the top of the review. If we had to determine the regression coefficients on the basis of just this one example, they might take the following values:
α = 1
? = –1
? = 1
In this case the predicted value of the review, computed from the formula given above, would be:
VALUE = 1 x 5 + (–1) x 1 + 1 = 5
The value computed from the linear regression formula would thus be identical to the point score of the review from the training set. Bravo!
Let us assume, however, that another review from the training set – again with five points awarded – contains the following opinion:
Superb restaurant! I can’t wait to go there again!
Here the numbers of positive and negative words are 1 each (Superb and can’t). If we apply the same values of α, ? and ?, the above formula will give a predicted value of:
VALUE = 1 x 1 + (–1) x 1 + 1 = 1
Here the value computed from the linear regression formula differs from the user’s actual score by as many as 4 points! The inclusion of this second data object in the training set thus makes it necessary to change the coefficients – to the following values, for example:
α = 0.5
? = –0.5
? = 4
After this modification, the predicted score for the first object will be 6 (one point too high), while for the second object it will be 4 (one point too low). For both objects jointly, therefore, the difference between the predicted and real values is just 2, whereas with the previous choice of values of coefficients it was 4.
Regression coefficients are determined in such a way that the sum of the differences between the predicted and actual values, over the entire set of training data, is as small as possible. This sum of differences between the results returned by the linear function and the actual values is called a cost function.
In mathematical language: the goal of the regression method is to find values of the regression coefficients such that the cost function attains its minimum.
The method of gradient descent is used to find the minimum point of a function – that is, the value(s) of its argument(s) for which the function takes its lowest value. We can illustrate this using an example in which the function has only one argument.
As we can see from the above graph, the analysed function takes its lowest value (–6) for the argument value x = 2. The value of 2 is thus the minimum point of this function.
To see how gradient descent works, consider the Tale of the Empty Tank.
Imagine you have gone for a drive in the mountains. Halfway up a hill the car suddenly comes to a stop – and for the most prosaic reason you could think of: its petrol tank is empty. You will now have to go on foot to the nearest petrol station.
You aren’t sure where the station is, although you know for certain that it is at the lowest point along the road. So which way should you go? You look around: the road leads upwards in one direction, and downwards in the other. So you take a step in the downwards direction, and then repeat the procedure.
Using this procedure, can you be sure of reaching the petrol station? If the cross-section of the road looks like the left-hand diagram, you will certainly reach your goal. However, if it looks something like the right-hand diagram, success is no longer guaranteed.
Gradient descent is analogous to the method used in the Tale of the Empty Tank:
Keep moving in this way until you reach a point where the function does not decrease on either side. You have now reached your minimum!
Such an algorithm will find a minimum effectively for all functions with a graph like the one in the left-hand diagram – those that have exactly one “valley”. These are called convex functions. However, the algorithm may not work correctly in the case of functions like the one in the right-hand diagram.
Luckily, it turns out that the function we are interested in – the cost function – is convex. Therefore, the method of gradient descent works perfectly in this case.
Unfortunately, the results obtained by linear regression do not always leave the user fully satisfied. Of course, when we expect the system to provide a predicted temperature, we are pleased to get a result expressed in degrees Celsius, and when we need a stock price forecast, we are happy for the system to give us a value in dollars. However, a review score expressed as a number like 3.45 remains unclear if we do not know what scale is being used. Perhaps in that case it would be better to return a result expressed as a percentage (e.g. 69%)? And if we want to predict Nancy Pelosi’s chances of being elected US president in 2024, we would certainly expect a result expressed in percentage terms, not a number from some unknown range.
Logistic regression makes it possible to recalculate the results obtained by linear regression so that they lie in the interval from 0 to 1. To do this, we take a result given by linear regression and apply a logistic function, whose values always lie within that interval.
An example of a logistic function is the sigmoid function:
As the graph shows, if the value returned by the linear function is 0, the sigmoid function will convert it to 0.5. The sigmoid function maps negative quantities to values in the range from 0 to 0.5, and positive quantities to values from 0.5 to 1. Therefore, all values of the sigmoid function are contained in the interval from 0 to 1. Which is just what we want!
Logistic regression is one of the fundamental methods of machine learning. It is used, for example, in neural networks, which thanks to new technological developments (specialised graphics cards, high-performance tensor processing units, etc.) are taking over more and more bastions of human intelligence.
But neural networks will be the topic of another post on our blog!
A model is a representation of some entity, serving to make it easier to work with. A model is often a miniature version of an object. When playing with a miniature model of an aeroplane, a child can add or subtract various components – mount wings, for example, or remove an engine. In an atlas or on a globe, which represent the Earth, we can cover hundreds or thousands of miles with just one sweep of a finger. On the other hand, if we want to observe the motion of elementary particles, the model of the atom must be a great deal larger than the original. A model as a representation of something makes it easier for us to understand and get to know that thing, from the general concept to detailed properties.
A model may be created for use in a computer application. A digital model of the architecture of a building makes it easier to make changes to the plans – modification can be performed by a human in a computer environment. Sound is modelled in a computer using a digital representation, which a program can transform (by raising the pitch of an utterance, for example) even without human intervention.
A computer language model is a representation of a language which enables a computer to perform certain operations, such as analysing the meaning of a text, carrying out instructions expressed in the language, or correctly translating a text into another language. When we speak of a computer language model (or language model for short), we usually have in mind a representation of just the text itself, omitting additional information about the sound of speech such as pitch or intonation.
In the 20th century it was believed that a language used by humans (a natural language) could be written in the form of rules. Every language was thought to be governed by some finite set of rules indicating how to put words together to form a correct sentence. Moreover, the rules were supposed to make it possible to identify the components from which a given sentence was built, and then to determine its meaning.
In the case of Polish, for example, the following three rules make it possible to formulate simple sentences:
According to these rules, one might create the following simple sentence, for instance:
Szary kotek pije wodę. (The grey cat drinks water.)
The first component of the sentence is the “subject part”, szary kotek (grey cat), built from an adjective in the nominative, szary (grey), and a noun also in the nominative, kotek (cat). The second component is the “predicate part”, pije wodę (drinks water), composed of a verb, pije (drinks), and a noun in the accusative, wodę (water).
Knowing the basic components of a sentence, its meaning can be determined. For example, it can be assumed that the “subject part” denotes the performer of an action, the verb denotes the action, and the noun in the accusative denotes the object being acted on. For example, in the sentence above, the performer of the action is a grey cat, the action is drinking, and the thing being drunk is water.
To date, unfortunately, it has not proved possible to create a set of rules for any language satisfying the following two requirements:
The difficulty in satisfying the first condition results from the fact that people are unbelievably creative in thinking up language constructions that philosophers would not have dreamed of (to convince yourself of this, just listen to politicians). The difficulty in satisfying the second requirement can be illustrated with the following sentence: The barren bus sounds a cat. Although this sentence conforms to the basic rules of English grammar (analogous to those given above for Polish), it does not make any sense. Surely nobody has ever spoken such a sentence elsewhere, and surely it won’t be repeated. It should not, therefore, be considered a correct English sentence.
A text corpus (plural: corpora) is a set of documents collected in one place to enable analyses to be performed on the entire collection of texts. For example, a set of the complete works of Shakespeare placed in one location (in a library if they are printed on paper, or on a computer disk if they are in digital form) makes up a text corpus which could be used, for example, to analyse the famous dramatist’s writing style.
The owner of the world’s largest digital corpus is undoubtedly Google. That organisation is able to perform analyses that encompass all of the texts accessible through its search engine.
A statistical model of a language is a representation based on some corpus of texts written in that language. A statistical model of Polish, for example, may be built from the set of all Polish-language texts accessible via the Google search engine.
Models of a given language may be very different from each other if they are built using different corpora. For example, we might construct specialised language models, using a corpus of texts taken exclusively from a particular specialist field.
Scrabble players may sometimes argue about whether a particular word is allowed in the game. In Poland, the arbiter used in such disputes is often the website sjp.pl (the abbreviation stands for słownik języka polskiego, “dictionary of the Polish language”). However, the language is constantly evolving, and a dictionary cannot always keep up with the changes.
But what if a somewhat different rule were adopted, namely that a word is allowed if it appears in a particular corpus – for example, the corpus of Polish texts available in Google? A new word, popular among younger speakers, such as plażing (“beaching”), might be deemed inadmissible by sjp.pl, although Google returns over 100,000 examples of its use.
We might, therefore, define a statistical language model for Scrabble, based on Polish-language texts indexed by Google, in the following way: a word is allowed if Google returns more than 1000 occurrences in Polish texts. This threshold is defined arbitrarily, but serves to eliminate words that might be found on the Internet by accident (as a result of typing errors, for instance).
Some words may appear more often in a given corpus, and others less often. We might therefore build a statistical language model that awards higher scores to words with a higher frequency of occurrence. The correctness of a word is evaluated on a scale from 0 (absolutely incorrect) to 1, where the value is obtained by dividing the number of occurrences of that word in the corpus by the total number of words in the corpus. For example, if the word in question appears five times in a corpus consisting of 100 words, its correctness will be 0.05. The correctness of a word that is absent from the corpus, of course, will be 0.
Such a model can also be applied to whole sentences. We calculate in turn the correctness of each word in the sentence, and then multiply all the values together. If any of the words fails to appear in the corpus, the sentence as a whole will be assigned the value 0; otherwise it will be assigned some non-zero value less than 1. A model of this type is called a unigram language model, since each word is considered independently of the others.
We may also try defining a language model as a function which assigns a correctness value between zero and one to any expression in the language (with respect to a given corpus).
For example, we might apply the unigram model described above, based on the Google corpus, to the previously cited sentence The barren bus sounds a cat. This would lead to a certain number larger than zero, since each of the words of this sentence will be confirmed by the corpus. However, this result does not agree with our intuition, since this particular sequence of words appears not to make any sense.
An n-gram language model, on the other hand, takes account of the occurrence of words in particular contexts. For example, a bigram model checks, for a given pair of words, how frequently the second word occurs immediately after the first. Google returns several million occurrences of the word barren, but only a few hundred in which it is followed by the word “bus”. This means that the correctness of the word pair barren bus is assessed as close to zero.
A bigram language model determines the correctness of all pairs of words appearing in a sentence (barren bus, bus sounds, etc.), and then multiplies the values together. If the sentence makes no sense, the resulting final value will be very close to zero.
In a similar way, one can construct 3-gram (trigram) or 4-gram language models. Models based on sequences of more than four words are not normally used.
The best-known application of language models is predictive text, used when typing text messages on a phone. After you type the first word of your message, the system suggests the next one. After each word, it comes up with a new set of suggestions.
How does this work? The system suggests words which, in combination with the preceding ones, give the highest correctness value (that is, they occur most frequently in the corpus). So if you begin by typing the word Good, the system might suggest that, according to the corpus it is using, the most “correct” continuations would be Good luck, Good morning or Good afternoon.
In the case of machine translation, the system first generates several potential ways of translating a sentence – differing in, for example, the choice between words of similar meaning or the word order. The language model then suggests which of the possibilities is most correct – and that text is supplied by the system as the translated sentence.
Language models perform a similar function in speech recognition. It is the model that decides that, for example, a dictated sentence is to be represented as “Let’s meet at four”, and not, for example, “Let’s meat at four” or “Let’s meet at fore”, even though all three of these versions sound the same.
A statistical language model is a method of representing a natural language based on a defined text corpus. Such a model awards higher scores to texts consisting of words and phrases that appear in the corpus more frequently.
A limitation of statistical language models, however, is that they take account only of associations between adjacent words. Let us look at the following example sentence:
The above contract contains errors of substance, and therefore it cannot be signed.
In this sentence, the words contract and signed occur at a long distance from each other (with as many as nine intervening words). A statistical language model is not able to detect that there is any semantic relation between these words, even though they refer to the same action: the signing of a contract.
Distant associations of this type are nonetheless detectable using neural network language models. Although the basic role of such models is similar to that of statistical models – they determine the degree of correctness of a text with respect to a particular corpus – the method of computing the correctness is different. The sophisticated structure of neural networks means that they are also able to take account of relationships between distant words.