Metrics (Bleu, ppl, gold ppl, pred ....)

After some testing, I have the feeling that Bleu is not the best metric for NMT.
Indeed, that could be just an impression, (or a wish :slight_smile:) but when comparing some SMT and NMT results, we get comparable Bleu score, however NMT phrase seem better constructed.

For clarification, I would like to discuss here how PPL and Prediction are computed in onmt.

My understanding is that at training time:

  • we have a batch size of n_s sentences made of n_w words in total. (n_s is the same for training batch size and valid data batch size)
  • at each “report iteration” we compute the training perplexity for N batches, ie N x n_s words.
    This per-word perplexity is computed according to the model weights as the log likelihood sum of each predicted word.
  • at the end of each “epoch” we compute the validation perplexity on the total words of the valid data set.

At inference time:

  • we compute a “PRED SCORE” : log likelihood of the target words according to the model (what scale ?)
    for instance when I see PRED AVG SCORE: -0.67, PRED PPL: 1.96 - I can’t do the math.
    EDITED: Now I can :slight_smile: exp(0.67) = 1.96
  • optionally if Gold Data is provided, we compute Gold score and Gold PPL (“according to the model”): how are they computed ?

To get back to the main question (Bleu, PPL, …) I was wondering if we could try to simultaneously have a target side LM that could be used to measure the PPL_lm of the model output which could give a metric on how relevant the sentence is based on the language only.

My point is that a n-gram based only metric is far from being accurate and I have the feeling that it is not bridging the same gap that we have between a n-gram LM and a lstm-LM.



The reported perplexity always depends on the true target data when available (training, validation and gold data). Otherwise it is the perplexity of the model’s own prediction (PRED). It is computed this way:

exp(loss / num_target_words)

with the loss being the cumulated negative log likelihood of the true target data. So it tells us how confidently the model would generate these targets.

At inference time:

  • PRED SCORE is the cumulated log likelihood of the generated sequence
  • PRED AVG SCORE is the log likelihood per generated word
  • PRED PPL is the perplexity of the model’s own predictions (exp(-PRED AVG SCORE))

Hi Vincent @vince62s.

In our experiments we have noticed the same - BLEU, TER, F-Measure, etc. (i.e., n-gram based scores) do not reflect the actual quality of NMT (especially compared to PBSMT).

You can check our paper:


1 Like

When I am ready to report a score, which one should I use?
the number which first appears or the one after the word BLEU?

39.20 +/-1.43 BLEU = 39.32, 72.9/49.3/35.1/25.2 (BP=0.932, ratio=0.934, hyp_len=20965, ref_len=22452)

I notice that when I run the same command again (no changes made), the first number and the Confidence interval vary with every execution, but the rest remain the same.

Can someone help clarify?

The actual score is 39.32. And 39.20+/-1.43 give you an idea of the error margin (using 95% error margin - these numbers are calculated k-fold random subsets). in other words if you compare with another run with a score of 39.7 - you can not really conclude that the second one is significantly better.

1 Like

Thanks @jean.senellart for the above help.

Regarding Perplexity (PPL):
I was reading deeper into Perplexity today as it relates to Language Models (in the Statistical Machine translation sense). I also tried to collect everything I could through crawling through the OpenNMT forum and Lua scripts to connect it to NMT.

According to this Stack Exchange question a model that scores a perplexity equal to the vocabulary size it is working with indicates a “dumb model” in which all words have uniform distribution. It was calculated assuming UNIgrams. A similar claim is made in this MIT NLP Lecture Notes file.

This makes me take away two things: the floor value possible in a PPL score is 1 and the ceiling is the size of my target word vocabulary. What I get from my OpenNMT search is that training PPL is computed over the total number of words from the corpus (?) and not the vocabulary.

So this resulted in the following questions:

  1. Can the same be said in the context of a NMT regarding floor and ceiling values for PPL?
  2. Does OpenNMT calculate the PPL with a unigram model? (I assume so on the grounds that RNN models produce element by element of a sequence)
  3. If the ceiling PPL score should be the size of my vocabulary (50,004) why would this be possible:
    Epoch 1 ; Iteration 50/64276 ; Optim SGD LR 1.0000 ; Source tokens/s 1153 ; Perplexity 6744931.15
    FYI: my source vocab was also 50004

Links to other sources would also be appreciated.

I include some of the ones that I have already looked at and which have been largely helpful until now.

[10] Metrics (Bleu, ppl, gold ppl, pred ....)

1 Like

Take for example the formula listed on the Wikipedia page for Perplexity:


In OpenNMT, we use b = e (the exponential, but it could also be b = 2). Also, N here is the number of samples which is the total number of words as we are working with unigrams as you pointed out.

  • The minimum value is when q(x_i) = 1 for all i (i.e. predict each sample with the highest confidence), which makes the perplexity equal to exp(0) = 1.
  • The maximum value is when q(x_i) = 0 for all i (i.e. predict each sample with the lowest confidence), which makes the perplexity equal to exp(+inf) = +inf.

and to add up on Guillaume answer, if the generation probability was uniform over the vocabulary (but it is never uniform, even at the beginning of the training because of the random initialisation) - then the PPL would be exp(-1/NNlog 1/V) => exactly V, the vocabulary size. which explains the connection you found with the vocabulary size.

So the vocabulary size is somewhat an interesting theoretical PPL to compare to - but practically, after the first mini-batches, you should have already a PPL << V.