OpenNMT-tf toy-ende model scoring and inference clarification

I am getting acquainted with toy-ende quickstart guide and I have a couple of quick questions about model scoring and inference. I hope these questions might also clarify things for others.

  1. I have set up BLEU scoring by changing the yml file to include:

    eval_delay: 300
    save_eval_predictions: True
    external_evaluators: BLEU

It seems that BLEU scores are only calculated after a training run concludes, is that intended behaviour?

In a quick training run lasting 5000 steps (18.5 minutes on 4 GPUs), why were BLEU scores calculated only at 2 points (11 minutes in and at the end, at 18.5 minutes) rather than every 300 seconds as I thought I specified with eval_delay: 300?

Which checkpoint is chosen for each evaluation? Say I ask for BLEU scores every 5 minutes, but checkpoints are saved every ~20 minutes, what happens then?

  1. Is it possible to have BLEU scores calculated during training, for both eval and test sets, so I can monitor these in tensorboard as training proceeds?

  2. The BLEU scores are extremely low (~0.2) even though MT output looks legible (probably related the point below)… any thoughts why this might be the case?

  3. I have run inference using a trained model using the command

onmt-main infer --auto_config --config config.yml --features_file src-test.txt > predictions.txt

However, the predictions don’t seem at all related to the sentences in src-test.txt?

head -5 src-test.txt
Orlando Bloom and Miranda Kerr still love each other
Actors Orlando Bloom and Model Miranda Kerr want to go their separate ways .
However , in an interview , Bloom has said that he and Kerr still love each other .
Miranda Kerr and Orlando Bloom are parents to two-year-old Flynn .
Actor Orlando Bloom announced his separation from his wife , supermodel Miranda Kerr .

head -5 predictions.txt
32 ist für Windows Versionen für Wien .
Kroatien Sie bei Bratislava / Pressburg Flughafen / 56 km .
Es hat jedoch jedoch um eine Tag , um eine in der EU in diesem Union .
Sie vom Installation / 30 teil .
Zusammen für drei Hügel aus der eigenen Entschärfung , , private Seen sind eine Rechts .

head -5 tgt-test.txt
Orlando Bloom und Miranda Kerr lieben sich noch immer
Schauspieler Orlando Bloom und Model Miranda Kerr wollen künftig getrennte Wege gehen .
In einem Interview sagte Bloom jedoch , dass er und Kerr sich noch immer lieben .
Miranda Kerr und Orlando Bloom sind Eltern des zweijährigen Flynn .
Schauspieler Orlando Bloom hat sich zur Trennung von seiner Frau , Topmodel Miranda Kerr , geäußert .

Are these predictions jumbled up somehow? The sentences do not seem aligned. Is this why the BLEU scores are so low? How can I connect/unscramble the predictions to the true labels in tgt-test so I can evaluate MT quality?

Thanks a ton in advance!

The evaluation runs when eval_delay seconds has passed and a new checkpoint is available. It will not evaluate the same checkpoint twice.

With your configuration, BLEU on eval sets should already be computed during the training. The test sets is not available during the training though.

That’s expected and stated un the quickstart itself:

While this example gave you a quick overview of a typical OpenNMT-tf workflow, it will not produce state of the art results. The selected dataset and model are too small for this task.

See previous answer and the Going further section of the quickstart to train “real” models.

1 Like

This makes sense :slight_smile: Thanks Guillaume!

About “Going further”, the link ( looks broken now. It would be good to fix it.

Thanks, will be fixed.

It’s this page for reference: