Back translation

Nart · September 15, 2020, 8:42pm

Hello,
I am retraining relative transformer ab-ru model with back translation, but the BLEU score is still lower after 90k training steps.

Parallel corpus (sentences 100k + words 100k) gave me on a test data 20 BLEU score for ab-ru model.
I augmented the back translation of 640k sentences and the BLEU score is not climbing above 19 BLEU score after 90k training steps.

From another post, I found out that setting beam_width: 1 should help.

Is there anything else I should be aware of that would improve performance?

Thank you,
Nart.

francoishernandez · September 16, 2020, 8:04am

You may want to have a look at this paper: https://arxiv.org/abs/1808.09381
(Notions of random sampling for instance.)

Nart · September 17, 2020, 12:32pm

Thank you for the link!
Going through this article in section 5.3 Low resource vs. high resource setup, it seems that in my case, beam search(=5) is more effective than sampling. Greedy search was not included in that setting for some reason.

Eventually sampling will be needed moving forward, so my question is how to achieve the unrestricted sampling from the model distribution, and how to use that in OpenNMT?

francoishernandez · September 17, 2020, 12:35pm

OpenNMT-tf: https://github.com/OpenNMT/OpenNMT-tf/blob/master/docs/inference.md#random-sampling

OpenNMT-py: https://opennmt.net/OpenNMT-py/options/translate.html#Random%20Sampling

Nart · September 17, 2020, 12:48pm

Great!
These parameters are used during training, isn’t that the case? so when I do inference later on, I would use different beam width, remove sampling and noise, right?
To achieve similar results to the paper, should these parameters be configured like this:

Random sampling:

params:
beam_width: 1
sampling_topk: 0
sampling_temperature: 1

Sampling(Top 10):

params:
beam_width: 1
sampling_topk: 10
sampling_temperature: 1

Beam+noise:

params:
beam_width: 5
decoding_subword_token: ▁ # for sentencepiece
decoding_noise:
- dropout: 0.1
- replacement: [0.1, ｟unk｠]
- permutation: 3

francoishernandez · September 17, 2020, 3:35pm

Random sampling is happening at inference when producing your backtranslations.
Basically it means that instead of greedy or beam search, tokens will be sampled randomly from the output distribution at each decoding step.
It’s some putting some (additional) noise in your dataset if you prefer.

ymoslem · February 28, 2021, 4:21pm

When it comes to adding noise to Back Translation, I would like to remind you, colleagues, of this paper, offering a simple approach, Tagged Back-Translation.

@Nart @mayub @prashanth