I see the OpenNMT-tf supports back translation and lot many users are interested in this.
I’m slightly confused with the latest version what new commands and parameters I need to change to make it work, I did not see any clear example of how to go about doing this therefore the confusion.
Do I separately pass the monolingual data file, if not, what pre-processing I need to do append it to training data.
Any additional model parameters I need to set - I do see some config parameters like “freeze_layers”, “sampling_topk” , “decoding_noise” etc. available ?
All NMT systems support back translation as it just a comibination of training and translating: training a model on the reverse direction and translating a monolingual corpus to generate new corpus. Are you looking for a script that does this workflow for you? There is no such thing in OpenNMT-tf.
The parameters “sampling_topk” and “decoding_noise” are some translation parameters that were proposed in the following paper:
They can be used to improve back translation results but are not required.
No problem. I can run those steps, how do i get started. Below is my thinking -
Let’s say I’m building for English-Spanish. I have a lot of Spanish monolingual corpus. Do I just run inference on this monolingual corpus using a pre-trained Spanish-English model to get synthetic sentences. ? At what point in the process will I need to set the mentioned back-translation parameters. ?
Great. This is will take a long time to run if we have millions of sentences. Apart from running on GPU and batching, any thoughts on fast forwarding this process.
Nice. Any suggestions on what best parameter values to set , below is my guess -