After some testing, I have the feeling that Bleu is not the best metric for NMT.
Indeed, that could be just an impression, (or a wish ) but when comparing some SMT and NMT results, we get comparable Bleu score, however NMT phrase seem better constructed.
For clarification, I would like to discuss here how PPL and Prediction are computed in onmt.
My understanding is that at training time:
- we have a batch size of n_s sentences made of n_w words in total. (n_s is the same for training batch size and valid data batch size)
- at each "report iteration" we compute the training perplexity for N batches, ie N x n_s words.
This per-word perplexity is computed according to the model weights as the log likelihood sum of each predicted word.
- at the end of each "epoch" we compute the validation perplexity on the total words of the valid data set.
At inference time:
- we compute a "PRED SCORE" : log likelihood of the target words according to the model (what scale ?)
for instance when I see PRED AVG SCORE: -0.67, PRED PPL: 1.96 - I can't do the math.
EDITED: Now I can exp(0.67) = 1.96
- optionally if Gold Data is provided, we compute Gold score and Gold PPL ("according to the model"): how are they computed ?
To get back to the main question (Bleu, PPL, ...) I was wondering if we could try to simultaneously have a target side LM that could be used to measure the PPL_lm of the model output which could give a metric on how relevant the sentence is based on the language only.
My point is that a n-gram based only metric is far from being accurate and I have the feeling that it is not bridging the same gap that we have between a n-gram LM and a lstm-LM.