Hi all,
I’ve implemented an interesting paper by Edunov et al. (2018) “Understanding Back-Translation at Scale”
The idea is really simple: say you are training a model to do en-de translation. You have some parallel corpora and you are also back-translating a monolingual corpora to augment your data
You can improve (quite a lot) the training signal from your syntetic data by adding noise to it
They propose to:
(1) Delete random words (with probability 0.1)
(2) Replace random words with filler token (with probability 0.1)
(3) Swap words (with max range up to 3)
I’ve published some results in the readme, it’s quite impressive and really simple
Available at: https://github.com/valentinmace/noisy-text
Feel free to play with noise parameters or implement your own functions
Also it would be interesting to know if other NLP tasks can benefit from it