How to evaluate the quality of automatic translation

July 20, 2020

How can we determine whether an automatic translation system is doing its job – that is, translating texts correctly and preserving the original meaning? How should we compare the quality of two translation systems so as to choose the one that best meets our needs? I will be trying to answer these questions in this blog.

Human evaluation

A translation may be evaluated by a human. Here, some predefined quality scale is used – usually a five-point scale, where a score of 5 denotes the highest quality. The translation of each sentence is scored separately, and finally the arithmetic mean of the scores for all sentences is calculated. Often, two components of quality are distinguished: the faithfulness of the translation to the original, and the correctness and fluency of the translated text.

Automatic evaluation using WER

Human evaluation is nonetheless an expensive and time-consuming task, and moreover is subjective in nature. A much cheaper solution, and one that is independent of human moods and biases, is automatic evaluation. In this case the translation is compared with a “gold standard” – an ideal translation produced by specialists. In the 20th century a popular metric used for this kind of evaluation was the Word Error Rate (WER). This is computed based on the number of changes – addition, deletion or substitution of a word – that would need to be made to the sentence proposed by the system in order to obtain the “gold standard” version. This number is then divided by the total number of words in the sentence being translated.

Let’s see how this method works by looking at a concrete example:

Sentence to be translated: Prawo zaskarżania nie przysługuje byłym członkom zarządu spółki.

“Gold standard” translation: The right to appeal shall not be granted to former members of the management board.

Translation proposed by the system: The right of appeal is not available to former members of the management board.

To go from the machine translation to the gold standard, we need to make three substitutions of words (of → to, is → shall, available → granted) and to insert one additional word (be). The length of the gold standard version is 15 words. Hence the WER for this proposed translation is 4/15. Clearly, the higher the value of the WER, the lower the quality of the translation.

Automatic evaluation using BLEU

Today, the metric most commonly used in automatic evaluation is BLEU (Bilingual Evaluation Understudy), proposed by IBM in 2002. Its value is directly proportional to the quality of the translation, and indicates what proportion of the machine translation corresponds to the gold standard. For instance, in the example given above, the fragments that correspond are The right and to former members of the management board; the remaining elements of the translation deviate from the standard. The value of the BLEU metric always lies in the interval from 0 to 1, and is often stated as a percentage.

Translation quality achieved by the world’s leading systems

The tables below contain the results of competitions in translating press reports, held at the Workshop for Machine Translation (WMT) in 2017 and 2018. This example shows that there was a significant improvement in quality over a single year.

name	BLEU
uedim-nmt	37,00
KIT	36,48
RWTH-nmt-ensemble	35,09
online-A	34,97
SYSTRAN	34,88
online-B	34,37
LIUM-NMT	31,75
C-3MA	30,64
online-G	30,09
TALP-UPC	29,95
online-F	19,49

Table 1: Competition results, WMT 2017

name	BLEU
RWTH	50,17
UCAM	49,88
NTT	48,71
JHU	47,57
MLLP-UPV	47,51
uedin	45,87
Ubiqus-NMT	45,57
online-B	45,47
online-A	43,34
LMU-nmt	43,17
online-Y	41,69
NJUNMT-private	39,72
online-G	36,39
online-F	23,86
RWTH-UNSUPER	20,35
LMU-unsup	19,12

Table 2: Competition results, WMT 2018

Translating Polish

In 2018, a group of researchers at Adam Mickiewicz University in Poznań, in collaboration with the company POLENG, carried out two experiments to evaluate the quality of translations of texts in a particular field, from and into Polish.

Specialist translation – from a broad field

In the first experiment the subject field was defined in broad terms, and the number of training texts supplied by the client was relatively small. POLENG engineers independently collected a sufficient number of texts to enable the system to be trained.

The final training set consisted of:

60,000 sentence pairs supplied by the client;
7.2 million sentence pairs collected by POLENG engineers.

The system was trained for translation from Polish to English and from English to Polish. The results of the experiments, in the form of values of the BLEU metric expressed as percentages, were as follows:

Polish–English translation	English–Polish translation
35,80	39,90

Table 3: Automatic evaluation of specialist translation from and into Polish

The translation results were also subjected to human evaluation, where approximately 500 sentences were assessed on a scale of 1 to 5, considering two aspects of quality: the faithfulness of the translation and its correctness. The following results were obtained:

aspect	Polish–English translation	English-Polish translation
faithfulness	4,23	3,90
correctness	3,94	3.74

Table 4. Human evaluation of specialist translation from and into Polish

It is interesting to note that according to the automatic BLEU metric the translations from English to Polish were assessed as being of better quality, while in human evaluation the translations from Polish to English scored higher. This may be because the evaluators were Polish, and took a more critical view of translations written in their native language.

Highly specialised translation – from a narrow field

The second experiment used a training set containing 1.2 million sentences, all supplied by one client. This time a comparison was made between two translation systems: one statistical, and one based on neural networks. Only translation from English to Polish was tested. A similar evaluation was also made for the Google Translate system, which is intended to handle general texts. The aim was to determine which of the translation methods produced better results in case of a relatively small database of training texts.

The following results were obtained:

system	BLEU percentage
statistical	55.23
neural network	51.66
Google Translate	21.37

Table 5. Comparison of quality of specialist translation into Polish

Both systems that had been trained on specialist texts produced results more than twice as high as that of the system designed for general translation. The results obtained based on a small specialist corpus were also better than those from the previous experiment, where the system was trained using a larger training set of texts from a more broadly defined field.

Surprisingly, the statistical system returned better results than the neural network system. Given this fact, it was decided to carry out an additional human evaluation. Here, two independent verifiers compared the translations supplied by the two systems, without being told which came from which type of system. For each of 4000 translation pairs, the verifier indicated which of the translations was superior, or declared a tie. The results were as follows:

winner	number of sentences	as percentage
statistical translation	829	20.73%
neural network translation	1248	31.20%
tie	1923	48.08%

Table 6. Human comparison of the quality of statistical and neural network translation systems

Which is better – statistical or neural network translation?

When scored by humans, the neural network method clearly outperformed the statistical method. This implies that the neural network produces better results according to human evaluation than would be indicated by the automatic BLEU metric. This fact, which was known previously, is explained by the specific construction of the BLEU metric, which favours “locally correct” translations. Neural network translation is oriented more towards analysing the relationships between words that are more distant from each other.

What comes next?

The quality of machine translation is constantly improving. It can therefore be expected to win an increasing share of the market. Automatic translation will be used primarily for technical and specialist texts, while humans will remain indispensable for the translation of general texts or those of mixed type. In the case of specialist translation, humans will work mainly on the post-editing of texts proposed by computers.

Neural network translation will remain the dominant technology, at least for the next few years, and constant progress will be achieved through the continued development of neural network architecture.

Table fo content

Primary Item (H2)Sub Item 1 (H3)Sub Item 2 (H4)
Sub Item 3 (H5)
Sub Item 4 (H6)

How to evaluate the quality of automatic translation

Human evaluation

Automatic evaluation using WER

Automatic evaluation using BLEU

Translation quality achieved by the world’s leading systems

Table 1: Competition results, WMT 2017

Table 2: Competition results, WMT 2018

Translating Polish

Specialist translation – from a broad field

Table 3: Automatic evaluation of specialist translation from and into Polish

Table 4. Human evaluation of specialist translation from and into Polish

Highly specialised translation – from a narrow field

Table 5. Comparison of quality of specialist translation into Polish

Table 6. Human comparison of the quality of statistical and neural network translation systems

Which is better – statistical or neural network translation?

What comes next?

Biuro Tłumaczeń
POLENG Sp. z o.o.

Ask for quick quote

How to evaluate the quality of automatic translation

Human evaluation

Automatic evaluation using WER

Automatic evaluation using BLEU

Translation quality achieved by the world’s leading systems

Table 1: Competition results, WMT 2017

Table 2: Competition results, WMT 2018

Translating Polish

Specialist translation – from a broad field

Table 3: Automatic evaluation of specialist translation from and into Polish

Table 4. Human evaluation of specialist translation from and into Polish

Highly specialised translation – from a narrow field

Table 5. Comparison of quality of specialist translation into Polish

Table 6. Human comparison of the quality of statistical and neural network translation systems

Which is better – statistical or neural network translation?

What comes next?

Biuro Tłumaczeń POLENG Sp. z o.o.

Ask for quick quote

Biuro Tłumaczeń
POLENG Sp. z o.o.