In automatic text classification, a computer system has the job of assigning texts to defined categories. In some applications the system itself decides how to define the categories (classes) of texts. For example, if someone wants to classify the latest news reports, but is not certain what topics they will relate to, they may tell the artificial intelligence (AI) system to divide the set of reports into a specified number of classes (10, say). The system will then group the reports in such a way that each category contains texts that use similar vocabulary.
Under the alternative scenario, it is the human who decides how to define the classes to which the texts are to be assigned. For example, in an e-mail classification task (see also: /en/blog/nlp-what-is-it-and-what-can-it-be-used-for) we may ask the AI system to classify each e-mail message according to its usefulness for the addressee, or to place it appropriately into the user’s spam folder (the SPAM category) or their inbox (the NON-SPAM category).
Similarly, in sentiment analysis (see: /en/blog/nlp-what-is-it-and-what-can-it-be-used-for) the classifier has to analyse reviews and place them into three predefined categories according to the writer’s attitude to the subject: negative, neutral or positive.
The basic technique for dividing texts into categories that have been predefined by a human is Bayes’ method. This is the classification method that I will be describing in this blog post.
Bayes’ method is based on supervised learning. This means that the AI system learns to perform a certain task by analysing a set of data for which the task has already been performed. In the case of spam recognition, the system learns the task with the help of e-mail messages that have already been classified appropriately by a human being.
Preparing data for the purpose of machine learning is a task that is quite time-consuming and intellectually uninteresting. It requires many repetitions of similar actions, the quality of whose performance is hard to verify reliably.
An effective way of overcoming such problems has been developed by Amazon, in the form of a model of working called the Amazon Mechanical Turk. This is used for tasks that require human intelligence, but not necessarily expertise in a particular field. In this model, tasks are assigned to a suitably large group of workers (called – in accordance with “the best principles of political correctness” – Mechanical Turks) in such small portions that each of them can perform their part at a time of day that suits them, with the ability to take a break from the task whenever necessary (for example, when getting off the bus or tram). Moreover, exactly the same portions of work are assigned to different workers who have no contact with each other, thus making it possible to verify the answers obtained and to prioritise those that are given consistently by multiple different assessors.
In the case of an e-mail classification task, the smallest portion of work to be done by the Mechanical Turk will be to assign a single e-mail message to one of two classes: SPAM or NON-SPAM. The collective results of the work of a large group of Mechanical Turks can then serve as a large training corpus to be used to prepare an AI system – in this case, one that is to perform the task of filtering out incoming spam e-mails.
A problem with AI, however, is that it can never be trusted with 100% certainty. In the case of a classification task, an AI system will only be able to determine that a given object probably belongs to a particular category. If the user requires the system to indicate a specific class to which an object is to be assigned (which is usually the case in practice), the system will choose the class for which the computed probability of the object’s belonging to that class takes the greatest value.
The Bayes classifier determines the probability of an object’s belonging to a particular class by making use of two factors: prior probability and posterior probability.
The first of these values is determined before starting to analyse the object. For example, the prior probability that a given e-mail is spam is determined purely from data on the percentage of e-mails in the training set that were placed in the SPAM category. If, for example, the training set consisted of 100 e-mails, and 30 of them were classified as spam, then the prior probability that a given e-mail is spam will be 30%.
Posterior probability is computed following analysis of the object. Its relevant characteristics are identified, and then it is established to which class those characteristics best correspond.
The list above contains the subject lines of e-mails which were classified as spam. Assume that these make up the training set, and on this basis the system has to classify a new e-mail with the following subject line:
Bargain! Subscription for $20 a month. Check it out!
One characteristic of this message is the use of exclamation marks. It turns out that this is a typical feature of messages classified as spam, as it occurs as many as five times in the ten elements of the training set that were placed in the SPAM category.
In practice, the Bayes classifier is used in such a way that for each of the characteristics of a given object, the system computes to what degree that characteristic is typical of a given class. In the case of text classification, for each word or punctuation mark in a given document, it is determined with what frequency it appears in all of the documents assigned to a particular category. The values obtained for all such characteristics are then multiplied together to obtain the value of the posterior probability.
The system places the object in the class for which the product of the two probabilities – the prior probability and the posterior probability – takes the greatest value.
Simple? The method is certainly not exceptionally complicated, but at the same time it is extremely effective. In fact, the effectiveness of the Bayes classifier is often taken as a benchmark for other more sophisticated solutions. It turns out to be not at all easy to beat!