After some testing, I have the feeling that Bleu is not the best metric for NMT.

Indeed, that could be just an impression, (or a wish ) but when comparing some SMT and NMT results, we get comparable Bleu score, however NMT phrase seem better constructed.

For clarification, I would like to discuss here how PPL and Prediction are computed in onmt.

My understanding is that at training time:

- we have a batch size of n_s sentences made of n_w words in total. (n_s is the same for training batch size and valid data batch size)
- at each â€śreport iterationâ€ť we compute the training perplexity for N batches, ie N x n_s words.

This per-word perplexity is computed according to the model weights as the log likelihood sum of each predicted word. - at the end of each â€śepochâ€ť we compute the validation perplexity on the total words of the valid data set.

At inference time:

- we compute a â€śPRED SCOREâ€ť : log likelihood of the target words according to the model (what scale ?)

for instance when I see PRED AVG SCORE: -0.67, PRED PPL: 1.96 - I canâ€™t do the math.

EDITED: Now I can exp(0.67) = 1.96 - optionally if Gold Data is provided, we compute Gold score and Gold PPL (â€śaccording to the modelâ€ť): how are they computed ?

To get back to the main question (Bleu, PPL, â€¦) I was wondering if we could try to simultaneously have a target side LM that could be used to measure the PPL_lm of the model output which could give a metric on how relevant the sentence is based on the language only.

My point is that a n-gram based only metric is far from being accurate and I have the feeling that it is not bridging the same gap that we have between a n-gram LM and a lstm-LM.

Vincent