Add noise to corpus and improve training signal

Hi all,

I’ve implemented an interesting paper by Edunov et al. (2018) “Understanding Back-Translation at Scale”

The idea is really simple: say you are training a model to do en-de translation. You have some parallel corpora and you are also back-translating a monolingual corpora to augment your data

You can improve (quite a lot) the training signal from your syntetic data by adding noise to it

They propose to:
(1) Delete random words (with probability 0.1)
(2) Replace random words with filler token (with probability 0.1)
(3) Swap words (with max range up to 3)

I’ve published some results in the readme, it’s quite impressive and really simple
Available at:

Feel free to play with noise parameters or implement your own functions :slight_smile:

Also it would be interesting to know if other NLP tasks can benefit from it



Thanks for sharing!

For reference, this feature can be enabled when decoding with OpenNMT-tf:

1 Like

Hi Guillaume,

I did not know about the OpenNMT implementation, it’s very cool

However is it possible to distinguish between your syntetic data and your parallel data (that you may to let without noise) at training time ?

Do you mean like tagging the noisy data?

My question was irrelevant, since the noise is supposed to take place at back-translation time in your implementation (if I understand correctly)

In my case it is like a preprocessing step after back-translation

So thanks for the implementation, and for the suggested paper which seems very interesting