Low ressource languages, infinite training, and back-translation

Hi all,
I’m back here with a question, after being elsewhere for a long time. Of course, I certainly lack some fresh informations about the details of recent evolutions of ONMT.
Do you have a link explaining how is working your Infinite Training ?
Would it be possible to link 2 models with infinite training, continuously producing back-translations for each other, using large monolingual data sets ?
It could be a very nice way to solve the problem of learning languages with very low available parallel data.
No ?

PS : if the monolingual data sets are known at the starting point, it’s also possible to provide with the good large vocab dicts right at the beginning, even if the very first parallel data set contains only a small part of it.

1 Like


I don’t think there are any public resources on this, sorry.

This approach looks similar to work being done in unsupervised NMT? See for example:


1 Like

This paper was presented at the ONMT Workshop, isn’t it ?

Yes, it’s the idea. But, I would like to apply this on low-ressource language pairs, using a continuous enrichment of the data while the models are trained, thus supposed to be better and better each step. It would be a kind of “bootstrapped infinite training with infinite data sets”.

Suppose you want to train a DE-PT cooking dedicated model, you don’t have any DE-PT well translated cooking data, but you have large monolingual cooking data.
Start from 2 generic DE-PT models, and link them in order to continuously train each other to repair bad translations produced by the other, being better and better at each step.
Would this bring us with high level cooking translation models ?
This video is a nice way of thinking of what appends in the back-translation training = a way to learn how to repair synthetically generated damaged data:

PS: the preview is ok, but the final post doesn’t show properly the link. Trying to put it as a quote: