BLEU Score (Bilingual Evaluation Study)
Introduced in 2002, BLEU (Bilingual Evaluation Understudy) revolutionised the field of Machine Translation as a go-to metric. Prior to BLEU, Machine Translation evaluation depended on expert human evaluators, a laborious, time-consuming, and costly process. This led evaluators to develop the BLEU Score. However, as an NLP engineer, you might wonder about its optimality and how much you can rely on BLEU evaluations. What constitutes an ideal BLEU score for your experiment? If you’ve pondered these questions before, this blog aims to assist you in building a solid understanding and foundation.
So, let’s first dive into the mathematical part of BLEU Score.
BLEU score is defined as
Here, BP is Brevity Penalty which is defined as
And pn is modified precision score which is defined as
Pretty intense, let unravel this one by one in reverse
Modified Precision Score
Count clip (n-gram): means tallying the n-grams in the candidate translation that are also found in the reference, but it’s limited to the maximum possible number of n-grams in the reference. This precaution prevents the system from gaming the evaluation by repeating the same word multiple times.
Let’s look at it with an example
- Reference: The quick brown fox
- Candidate Translation: Quick quick quick quick
```
Without clipping, the precision for 1-gram would be 4/4, resulting in a
perfect score of 1. However, after clipping, we can only count 'quick' once,
as it appears only once in the reference. This leads to a precision of 1/4,
which equals 0.25. This adjustment seems appropriate.
```
Count (n-gram): counts all the n-grams in candidate translation
∑ n-gram ∈ C: points that we need to do this each sentence in the Candidate
∑ C∈ {Candidates}: points that it needs to be done for every datapoint in dataset of Candidates
In essence, if we choose to evaluate up to 5-grams, we would calculate modified precision for each data point, each sentence. As a result, we would end up with 5 precision scores.
Weighted Logarithm
How these 5 modified precision scores are combined is determined by the authors using the average logarithm with uniform weights, as evident in the formula’s
An inevitable question arises: how far should we extend this evaluation, considering 5-grams, 6-grams, or beyond? In essence, as the validation length increases, the BLEU score tends to decrease. However, relying on very large n-grams doesn’t necessarily guarantee accuracy. Constructing the same sentences in multiple ways using overly long n-grams doesn’t contribute to a more accurate evaluation. The authors, after careful consideration, settled for 4-grams, and that seems to be an appropriate choice.
Brevity Penalty
Now, let’s delve into the final component of the formula: Brevity Penalty. Brevity penalty is introduced to prevent the translation system from producing very short translations while still exhibiting high accuracy.
Let’s look at it with an example
- Reference: The quick brown fox
- Candidate Translation: Quick quick quick quick
```
Without Brevity Penalty, this sentence would have a 1-gram BLEU score of 1,
but as we can observe, that wouldn't be accurate. With the Brevity Penalty,
the score is adjusted to 0.049.
```
That concludes the mathematical explanation behind the BLEU score. Now, let’s delve into other potential questions that may arise.
Why only use precision and not recall or F1 score ?
The reasoning behind this approach lies in the fact that a single sentence can be translated in various ways. The concept of recall, measuring how closely the translated candidate sentence aligns with the reference sentence, wouldn’t be a suitable evaluation metric in this case. Instead, the authors addressed the issue of recall by emphasising the importance of using multiple references. These references should cover a range of writing styles to provide a more comprehensive evaluation of the system.
Another question that may pop is how do we know that BLEU worked ?
The BLEU score was employed to assess three systems and two non-native translators across 127 source sentences. The following are the BLEU scores for these five systems. The results were further validated by experts, and they ranked the systems in the same order.
One may ask, what is an ideal BLEU score ?
Determining an absolute threshold for a ‘good’ BLEU score is challenging and heavily depends on the specific dataset used as a reference. For instance, when researchers evaluated the BLEU score of human translators, they achieved a 0.35 BLEU score when assessed against four reference statements and 0.25 when assessed against two references. While these numbers can serve as a reference, it’s important to note that BLEU effectively captures human judgment. Therefore, if BLEU ranks system 1 higher than system 2, we can consider system 2 as superior. This insight highlights the usefulness of BLEU in evaluating experimental variations. The paper also demonstrates a correlation between BLEU Score and Human Judgment.
Here monolingual and bilingual refer to the human evaluators.
So, what are the drawbacks of BLEU score?
When human evaluators assess machine translation systems, their focus lies on aspects like adequacy, fidelity, and fluency of translation. Can BLEU capture all these nuances? Clearly not. Let’s briefly discuss these drawbacks so that we can segue into exploring other evaluation techniques.
- BLEU is a syntactical evaluation system: The English language goes beyond syntax; it also encompasses meaning. Clearly, BLEU does not capture this. A sentence with no meaningful content but using the right set of words jumbled up can receive a high BLEU rating.
- Reference: The quick brown fox jumped over the lazy dog.
- Translation1: The quick brown fox over the lazy dog. ## BLEU Score = 0.53
- Translation2: The brown fox jumped over lazy dog. ## BLEU Score = 0.33
``` Clearly T2 is better than T1 but bleu score shows otherwise ```
- BLEU does not consider fluency
- Reference: Road from Marks office to home gets very busy late evenings
thus he left early from office to reach his family on time.
- Translation1: During evenings the road from office to home gets busy
thus Mark left early. ## BLEU Score = 0.14
- Translation2: The Road from Marks to office home gets busy evenings and
he left early. ## BLEU Score = 0.85
``` Clearly T1 is much more fluent but BLEU score T2 higher ```
- Another issue with BLEU is its underspecified nature. For instance, there’s no concrete decision provided on the choice of n-gram count or the count of references. As demonstrated earlier, altering these parameters can significantly impact the BLEU score, allowing researchers to potentially manipulate it to their advantage. This concern was addressed in another paper from 2018, which proposed a more robust scoring method. We will delve into this in future write-ups.
- The pre-processing of data plays a significant role in influencing BLEU scores. For instance, actions like lowercasing, lemmatisation, and stemming can notably enhance BLEU scores. Tokenisation, concepts such as replacing unknown words with ‘<unk>’, can also lead to substantial improvements in BLEU scores.
Let’s also look at the code to calculate BLEU Score in python for this we would use the evaluate library from huggingface.
import evaluate
# Intitalize
bleu = evaluate.load("bleu")
reference = ["Road from Marks office to home gets very busy late evenings thus he left early from office to reach his family on time."]
translation = ["During evenings the road from office to home gets busy thus Mark left early."]
bleu.compute(predictions = translation, references = reference)
# {
# 'bleu': 0.14620368657448796,
# 'precisions': [0.7333333333333333, 0.35714285714285715, 0.23076923076923078, 0.08333333333333333],
# 'brevity_penalty': 0.5488116360940264,
# 'length_ratio': 0.625,
# 'translation_length': 15,
# 'reference_length': 24
# }
Conclusion
No doubt, the BLEU score remains a valid metric for evaluating translation systems. It is quick, simple to interpret, and strongly correlates with human evaluations. It proves useful for comparing results within an experiment. However, caution should be exercised when using it to compare between different papers, as it does not guarantee the generation of semantically correct sentences. These limitations have been addressed in subsequent papers, which we will explore in future discussions.