Traditional Versus Neural Metrics for Machine Translation Evaluation


100+ latest metrics since 2010

Image from Pixabay

An evaluation with automatic metrics has the benefits to be faster, more reproducible, and cheaper than an evaluation conducted by humans.

This is very true for the evaluation of machine translation. For a human evaluation, we might ideally need expert translators

For a lot of language pairs, such experts are extremely rare and difficult to rent.

A big-scale and fast manual evaluation, as required by the very dynamic research area of machine translation to judge latest systems, is usually impractical.

Consequently, automatic evaluation for machine translation has been a very energetic, and productive, research area for greater than 20 years.

While BLEU stays by far probably the most used evaluation metric, there are countless higher alternatives.

Since 2010, 100+ automatic metrics have been proposed to enhance machine translation evaluation.

In this text, I present the most well-liked metrics which can be used as alternatives, or as well as, to BLEU. I grouped them into two categories: traditional or neural metrics, each category having different benefits.

Most automatic metrics for machine translation only require:

  • The translation hypothesis generated by the machine translation system to judge
  • A minimum of one reference translation produced by humans
  • (Rarely) the source text translated by the machine translation system

Here is an example of a French-to-English translation:

Le chat dort dans la cuisine donc tu devrais cuisiner ailleurs.

  • Translation hypothesis (generated by machine translation):

The cat sleeps within the kitchen so cook some other place.

The cat is sleeping within the kitchen, so it’s best to cook some other place.

The interpretation hypothesis and the reference translation are each translations of the identical source text.

The target of the automated metric is to yield a rating that might be interpreted as a distance between the interpretation hypothesis and the reference translation. The smaller the gap is and the closer the system is to generate a translation of human quality.

Absolutely the rating returned by a metric is often not interpretable alone. It is sort of at all times used to rank machine translation systems. A system with a greater rating is a greater system.

In one among my studies (Marie et al., 2021), I showed that just about 99% of the research papers in machine translation depend on the automated metric BLEU to judge translation quality and rank systems, while greater than 100 other metrics have been proposed in the course of the last 12 years. Note: I looked only at research papers published from 2010 by the ACL. Potentially many more metrics have been proposed to judge machine translation.

Here’s a non-exhaustive list of 106 metrics proposed from 2010 to 2020 (click on the metric name to get the source):

Noun-phrase chunking, SemPOS refinement, mNCD, RIBES, prolonged METEOR, Badger 2.0, ATEC 2.1, DCU-LFG, LRKB4, LRHB4, I-letter-BLEU, I-letter-recall, SVM-RANK,TERp, IQmt-DR, BEwT-E, Bkars, SEPIA, MEANT, AM-FM. AMBER, F15, MTeRater, MP4IBM1, ParseConf, ROSE, TINE, TESLA-CELAB, PORT, lexical cohesion, pFSM, pPDA, HyTER, SAGAN-STS, SIMPBLEU, SPEDE, TerrorCAT, BLOCKERRCATS, XENERRCATS, PosF, TESLA, LEPOR, ACTa, DEPREF, UMEANT, LogRefSS, discourse-based, XMEANT, BEER, SKL, AL-BLEU, LBLEU, APAC, RED-*, DiscoTK-*, ELEXR, LAYERED, Parmesan, tBLEU, UPC-IPA, UPC-STOUT, VERTa-*, pairwise neural, neural representation-based, ReVal, BS, LeBLEU, chrF, DPMF, Dreem, Ratatouille, UoW-LSTM, UPF-Colbat, USAAR-ZWICKEL, CharacTER, DepCheck, MPEDA, DTED, meaning features, BLEU2VEC_Sep, Ngram2vec, MEANT 2.0, UHH_TSKM, AutoDA, TreeAggreg, BLEND, HyTERA, RUSE, ITER, YiSi, BERTr, EED, WMDO, PReP, cross-lingual similarity+goal language model, XLM+TLM, Prism, COMET, PARBLEU, PARCHRF, MEE, BLEURT, BAQ-*, OPEN-KIWI-*, BERT, mBERT, EQ-*

Most of those metrics have been shown to be higher than BLEU, but have never been used. In actual fact, only 2 (1.8%) of those metrics, RIBES and chrF, have been utilized in greater than two research publications (among the many 700+ publications that I checked). Since 2010, probably the most used metrics are metrics proposed before 2010 (BLEU, TER, and METEOR):

Table by Marie et al., 2021

A lot of the metrics created after 2016 are neural metrics. They depend on neural networks and probably the most recent ones even depend on the extremely popular pre-trained language models.

In contrast, traditional metrics published earlier might be more easy and cheaper to run. They continue to be extremely popular for various reasons, and this popularity doesn’t seem to say no, not less than in research.

In the next sections, I introduce several metrics chosen based on their popularity, their originality, or their correlation with human evaluation.

Traditional metrics for machine translation evaluation might be seen as metrics that evaluate the gap between two strings simply based on the characters they contain.

These two strings are the interpretation hypothesis and the reference translation. Note: Typically, traditional metrics don’t exploit the source text translated by the system.

WER (Word Error Rate) was one probably the most used of those metrics, and the ancestor of BLEU, before BLEU took over within the early 2000’s.


  • Low computational cost: Most traditional metrics depend on the efficiency of string matching algorithms run at character and/or token levels. Some metrics do must perform some shifting of tokens which might be more costly, particularly for long translations. Nonetheless, their computation is definitely parallelizable and doesn’t require a GPU.
  • Explainable: Scores are often easy to compute by hand for small segments and thus facilitate the evaluation. Note: “Explainable” doesn’t mean “interpretable”, i.e., we will exactly explain how a metric rating is computed, however the rating alone can’t be interpreted because it often tells us nothing of the interpretation quality.
  • Language independent: Except some particular metrics, the identical metric algorithms might be applied independently of the language of the interpretation.


  • Poor correlation with human judgments: That is their predominant drawback against neural metrics. To get the very best estimation of the standard of a translation, traditional metrics shouldn’t be used.
  • Require particular preprocessing: Apart from one metric (chrF), all the standard metric I present in this text requires the evaluated segments, and their reference translations, to be tokenized. The tokenizer isn’t embedded within the metric, i.e., it must be performed by the user using external tools. The scores obtained are then depending on a selected tokenization that might not be reproducible.


That is the most well-liked metric. It’s utilized by almost 99% of the machine translation research publications.

I already presented BLEU in one among my previous article.

BLEU is a metric with many well-identified flaws.

What I didn’t discuss in my two articles about BLEU is the various variants of BLEU.

When reading research papers, you might find metrics denoted BLEU-1, BLEU-2, BLEU-3, and so forth. The number after the hyphen is often the utmost length of the n-grams of tokens used to compute the rating.

As an illustration, BLEU-4 is a BLEU computed by taking {1,2,3,4}-grams of tokens under consideration. In other words, BLEU-4 is the everyday BLEU computed in most machine translation papers, as originally proposed by Papineni et al. (2002).

BLEU is a metric that requires numerous statistics to be accurate. It doesn’t work well on short text, and will even yield an error if computed on a translation that doesn’t match any 4-grams from the reference translation.

Since evaluating translation quality at sentence level could also be vital in some applications or for evaluation, a variant denoted sentence BLEU, sBLEU, or sometimes BLEU+1 might be used. It avoids computational errors. There are lots of variants of BLEU+1. The preferred ones are described by Chen and Cherry (2014).

As we are going to see with neural metrics, BLEU+1 has many higher alternatives and shouldn’t be used.


chrF (Popović, 2015) is the second hottest metric for machine translation evaluation.

It has been around since 2015 and has since been increasingly utilized in machine translation publications.

It has been shown to higher correlate with human judgment than BLEU.

As well as, chrF is tokenization independent. That is the one metric with this feature that I do know of. Because it doesn’t require any prior custom tokenization by some external tool, it’s the most effective metrics to make sure the reproducibility of an evaluation.

chrF exclusively relies on the characters. Spaces are ignored by default.

chrF++ (Popović, 2017) is a variant of chrF that higher correlates with human evaluation but at the fee of its tokenization independence. Indeed, chrF++ exploits spaces to keep in mind word order, hence its higher correlation with human evaluation.

I do strongly recommend the usage of chrF once I review machine translation papers for conferences and journals to make an evaluation more reproducible, but not chrF++ as a result of its tokenization dependency.

Note: Be wary whenever you read a research work using chrF. Authors often confuse chrF and chrF++. They might also cite the chrF paper when using chrF++, and vice versa.

The original implementation of chrF by Maja Popović is out there on github.

It’s also possible to find an implementation in SacreBLEU (Apache 2.0 license).


RIBES (Isozaki et al., 2010) is repeatedly utilized by the research community.

This metric was designed for “distant language pairs” with very different sentence structures.

As an illustration, translating English into Japanese requires a big word reordering for the reason that verb in Japanese is situated at the tip of the sentence while in English it is normally placed before the complement.

The authors of RIBES found that the metrics available at the moment, in 2010, weren’t sufficiently penalizing incorrect word order and thus proposed this latest metric as an alternative.

An implementation of RIBES is out there on Github (GNU General Public License V2.0).


METEOR (Banerjee and Lavie, 2005) was first proposed in 2005 with the target of correcting several flaws of traditional metrics available at the moment.

As an illustration, BLEU only counts exact token matches. It is simply too strict since words should not rewarded by BLEU in the event that they should not the exact same within the reference translation even in the event that they have an analogous meaning. As such, BLEU is blind to many valid translations.

METEOR partly corrects this flaw by introducing more flexibility within the matching. Synonyms, word stems, and even paraphrases are all accepted as valid translations, effectively improving the recall of the metric. The metric also implements a weighting mechanism to offer more importance, as an example, to a precise matching over a stem matching.

The metric is computed by the harmonic mean between recall and precision, with the particularity that the recall has the next weight than precision.

METEOR higher correlates with human evaluation than BLEU, and has been improved multiple times until 2015. It continues to be repeatedly used nowadays.

METEOR has an official webpage maintained by CMU which proposes the unique implementation of the metric (unknown license).


TER (Snover et al., 2006) is principally used to judge the trouble it will take for a human translator to post-edit a translation.


Post-editing in machine translation is the motion of correcting a machine translation output into a suitable translation. Machine translation followed by post-editing is a typical pipeline utilized in the interpretation industry to scale back translation cost.

There are two well-known variants: TERp (Snover et al., 2009) and HTER (Snover et al., 2009, Specia and Farzindar, 2010).

TERp is TER augmented with a paraphrase database to enhance the recall of the metric and its correlation with human evaluation. A match between the hypothesis and the reference is counted if a token, or one among its paraphrases, from the interpretation hypothesis is within the reference translation.

HTER, standing for “Human TER”, is a typical TER computed between machine translation hypothesis and its post-editing produced by a human. It may possibly be used to judge the fee, a posteriori, of post-editing a selected translation.


The name of the metric already gives some hints on how it really works: That is the TER metric applied at character level. Shift operations are performed at word level.

The edit distance obtained can also be normalized by the length of the interpretation hypothesis.

CharacTER (Wang et al., 2016) has one among the best correlation with human evaluation amongst the standard metrics.

Nonetheless, it stays less used than other metrics. I couldn’t find any papers that used it recently.

The implementation of characTER by its authors is out there on Github (unknown license).

Neural metrics take a really different approach from the standard metrics.

They estimate a translation quality rating using neural networks.

To the very best of my knowledge, ReVal, proposed in 2015, was the primary neural metric with the target of computing a translation quality rating.

Since ReVal, latest neural metrics are repeatedly proposed for evaluating machine translation.

The research effort in machine translation evaluation is now almost exclusively specializing in neural metrics.

Yet, as we are going to see, despite their superiority, neural metrics are removed from popular. While neural metrics have been around for nearly 8 years, traditional metrics are still overwhelmingly preferred, not less than by the research community (the situation might be different within the machine translation industry).


  • Good correlation with human evaluation: Neural metrics are state-of-the-art for machine translation evaluation
  • No preprocessing required: This is principally true for recent neural metrics equivalent to COMET and BLEURT. The preprocessing, equivalent to tokenization, is completed internally and transparently by the metric, i.e., the users don’t must care about it.
  • Higher recall: Because of the exploitation of embeddings, neural metrics can reward translation even once they don’t exactly match the reference. As an illustration, a word that has a meaning much like a word within the reference will likely be likely rewarded by the metric, in contrast to traditional metrics that may only reward exact matches.
  • Trainable: This will be a bonus in addition to an obstacle. Most neural metrics have to be trained. It is a bonus if you could have training data in your specific use case. You may fine-tune the metric to best correlate with human judgments. Nevertheless, in case you don’t have the particular training data, the correlation with human evaluation will likely be removed from optimal.


  • High computational cost: Neural metrics don’t require a GPU but are much faster if you could have one. Yet, even with a GPU, they’re significantly slower than traditional metrics. Some metrics counting on large language models equivalent to BLEURT and COMET also require a big amount of memory. Their high computational cost also makes statistical significance testing extremely costly.
  • Unexplainable: Understanding why a neural metric yields a selected rating is almost not possible for the reason that neural model behind it often leverages hundreds of thousands or billions of parameters. Improving the explainability of neural models is a really energetic research area.
  • Difficult to take care of: Older implementations of neural metrics don’t work anymore in the event that they weren’t properly maintained. This is principally as a result of the changes in nVidia CUDA and/or frameworks equivalent to (py)Torch and Tensorflow. Potentially, the present version of the neural metrics we use today won’t work in 10 years.
  • Not reproducible: Neural metrics often include many more hyperparameters than traditional metrics. Those are largely underspecified within the scientific publications using them. Subsequently, reproducing a selected rating for a selected dataset is usually not possible.


To the very best of my knowledge, ReVal (Gupta et al., 2015) is the primary neural metric proposed to judge machine translation quality.

ReVal was a big improvement over traditional metrics with a significantly higher correlation with human evaluation.

The metric is predicated on an LSTM and could be very easy, but has never been utilized in machine translation research so far as I do know.

It’s now outperformed by newer metrics.

Should you have an interest to know how it really works, you may still find ReVal’s original implementation on Github (GNU General Public License V2.0).


YiSi (Chi-kiu Lo, 2019) is a really versatile metric. It mainly exploits an embedding model but might be augmented with various resources equivalent to a semantic parser, a big language model (BERT), and even features from the source text and source language.

Using all these options could make it fairly complex and reduces its scope to a number of language pairs. Furthermore, the gains by way of correlation with human judgments when using all these options should not obvious.

Nonetheless, the metric itself, using just the unique embedding model, shows a superb correlation with human evaluation.

Figure by Chi-kiu Lo, 2019

The writer showed that for evaluating English translations YiSi significantly outperforms traditional metrics.

The unique implementation of YiSi is publicly available on Github (MIT license).


BERTScore (Zhang et al., 2020) exploits the contextual embeddings of BERT for every token within the evaluated sentence and compares them with the token embeddings of the reference.

It really works as illustrated below:

Figure by Zhang et al., 2020

It’s one among the primary metrics to adopt a big language model for evaluation. It wasn’t proposed specifically for machine translation but reasonably for any language generation task.

BERTScore is probably the most used neural metric in machine translation evaluation.

A BERTScore implementation is out there on Github (MIT license).


BLEURT (Sellam et al., 2020) is one other metric counting on BERT but that might be specifically trained for machine translation evaluation.

More precisely, it’s a BERT model fine-tuned on synthetic data which can be sentences from Wikipedia paired with their random perturbations of various kinds: Note: This step is confusedly denoted “pre-training” by the authors (see note 3 within the paper) but it surely actually comes after the unique pre-training of BERT.

  • Masked word (as in the unique BERT)
  • Dropped word
  • Backtranslation (i.e., sentences generated by a machine translation system)

Each sentence pair is evaluated during training with several losses. A few of these losses are computed with evaluation metrics:

Table by Sellam et al., 2020

Finally, in a second phase, BLEURT is fine-tuned on translations and their rating provided by humans.

Intuitively, because of the usage of synthetic data which will resemble machine translation errors or outputs, BLEURT is way more robust to quality and domain drifts than BERTScore.

Furthermore, since BLEURT exploits a mix of metric as “pre-training signals”, it’s intuitively higher than each one among these metrics, including BERTScore.

Nevertheless, BLEURT could be very costly to coach. I’m only aware of BLEURT checkpoints released by Google. Note: Should you are aware of other models, please let me know within the comments.

The primary version was only trained for English, however the newer version, denoted BLEURT-20, now includes 19 more languages. Each BLEURT versions can be found in the identical repository.


Of their work proposing Prism, Thompson and Post (2019) intuitively argue that machine translation and paraphrasing evaluation are very similar tasks. Their only difference is that the source language shouldn’t be the identical.

Indeed, with paraphrasing, the target is to generate a latest sentence A’, given a sentence A, with A and A’ having the identical meaning. Assessing how close A and A’ is an identical to assessing how a translation hypothesis is near a given reference translation. In other words, is the interpretation hypothesis an excellent paraphrase of the reference translation.

Prism is a neural metric trained on a big multilingual parallel dataset through a multilingual neural machine translation framework.

Then, at inference time, the trained model is used as a zero-shot paraphraser to attain the similarity between a source text (the interpretation hypothesis) and the goal text (the reference translation) which can be each in the identical language.

The predominant advantage of this approach is that Prism doesn’t need any human evaluation training data nor any paraphrasing training data. The one requirement is to have parallel data for the languages you intend to judge.

While Prism is original, convenient to coach, and seems to outperform most other metrics (including BLEURT), I couldn’t find any machine translation research publication using it.

The unique implementation of Prism is publicly available on Github (MIT license).


COMET (Rei et al., 2020) is a more supervised approach also based on a big language model. The authors chosen XLM-RoBERTa but mention that other models equivalent to BERT could also work with their approach.

In contrast to most other metrics, COMET exploits the source sentence. The big language model is thus fine-tuned on a triplet {translated source sentence, translation hypothesis, reference translation}.

Figure by Rei et al., 2020

The metric is trained using human rankings (the identical ones utilized by BLEURT).

COMET is way more easy to coach than BLEURT because it doesn’t require the generation and the scoring of synthetic data.

COMET is out there in lots of versions, including distilled models (COMETHINO) which have a much smaller memory footprint.

The released implementation of COMET (Apache license 2.0) also features a tool to efficiently perform statistical significance testing.

Machine translation evaluation is a very energetic research area. Neural metrics are convalescing and more efficient every yr.

Yet, traditional metrics equivalent to BLEU remain the favorites of machine translation practitioners, mainly by habits.

In 2022, the Conference on Machine Translation (WMT22) published a rating of evaluation metrics based on their correlation with human evaluation, including metrics I presented in this text:

Table by Freitag et al. (2022)

COMET and BLEURT rank at the highest while BLEU appears at the underside. Interestingly, you can even notice on this table that there are some metrics that I didn’t write about in this text. A few of them, equivalent to MetricX XXL, are undocumented.

Despite having countless higher alternatives, BLEU stays by far probably the most used metric, not less than in machine translation research.

Personal recommendations:

Once I review scientific papers for conferences and journals, I at all times recommend the next to the authors who only use BLEU for machine translation evaluation:

  • Add the outcomes for not less than one neural metric equivalent to COMET or BLEURT, if the language pair is roofed by these metrics.
  • Add the outcomes for chrF (not chrF++). While chrF shouldn’t be state-of-the-art, it’s significantly higher than BLEU, yield scores which can be easily reproducible, and might be used for diagnostic purposes.


What are your thoughts on this topic?
Let us know in the comments below.


0 0 votes
Article Rating
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x