OpenNMT Forum

Back translation

Hello,
I am retraining relative transformer ab-ru model with back translation, but the BLEU score is still lower after 90k training steps.

Parallel corpus (sentences 100k + words 100k) gave me on a test data 20 BLEU score for ab-ru model.
I augmented the back translation of 640k sentences and the BLEU score is not climbing above 19 BLEU score after 90k training steps.

From another post, I found out that setting beam_width: 1 should help.

Is there anything else I should be aware of that would improve performance?

Thank you,
Nart.

You may want to have a look at this paper: https://arxiv.org/abs/1808.09381
(Notions of random sampling for instance.)

1 Like

Thank you for the link!
Going through this article in section 5.3 Low resource vs. high resource setup, it seems that in my case, beam search(=5) is more effective than sampling. Greedy search was not included in that setting for some reason.

Eventually sampling will be needed moving forward, so my question is how to achieve the unrestricted sampling from the model distribution, and how to use that in OpenNMT?

OpenNMT-tf: https://github.com/OpenNMT/OpenNMT-tf/blob/master/docs/inference.md#random-sampling

OpenNMT-py: https://opennmt.net/OpenNMT-py/options/translate.html#Random%20Sampling

Great!
These parameters are used during training, isn’t that the case? so when I do inference later on, I would use different beam width, remove sampling and noise, right?
To achieve similar results to the paper, should these parameters be configured like this:

  1. Random sampling:

params:
beam_width: 1
sampling_topk: 0
sampling_temperature: 1

  1. Sampling(Top 10):

params:
beam_width: 1
sampling_topk: 10
sampling_temperature: 1

  1. Beam+noise:

params:
beam_width: 5
decoding_subword_token: ▁ # for sentencepiece
decoding_noise:
- dropout: 0.1
- replacement: [0.1, ⦅unk⦆]
- permutation: 3

Random sampling is happening at inference when producing your backtranslations.
Basically it means that instead of greedy or beam search, tokens will be sampled randomly from the output distribution at each decoding step.
It’s some putting some (additional) noise in your dataset if you prefer.