Expanding the model to two-directional and/or Multi-Language Support


I have come up an English to Spanish (Domain Adapted) model that does well. This involved me coming up with parallel data, setting up lot of pre-processing steps, developing the base model and then fine-tuning on domain data.
I’m wondering how can I do the following with OpenNMT-tf repo -

  1. Create a two directional model - one that performs both English --> Spanish and Spanish --> English. Is this even possible or I have to follow the same process steps twice ending up with two models one for Engligh–>Spanish and another for Spanish to English.
  2. Scale model to multiple languages - Something like Google’s Zero Shot translation model. I came across this repo while researching Cross Lingual Modeling. Is this something already available within OpenNMT or in the road map for development

Any help is appreciated.

1 Like


Have you seen this one?

It’s a tutorial for OpenNMT-lua but the idea is generic.

Thanks @guillaumekln Let me take a look.

I’m guessing I can extend this for English and also other language families apart from Romance language family. Seems like it requires parallel data to be present for domain adaption as well. (i might not have enough domain parallel data for other languages hence was also researching on some unsupervised techniques)

If I had to replicate this tutorial using OpenNMT-tf which model would I instantiate ?

looks like Lua is using brnn from this training step
th train.lua -layers 4 -rnn_size 1000 -brnn -word_vec_size 600 -data ${DATA}/esfritptro-multi-train.t7 \ -save_model ${DATA}/onmt_esfritptro-4-1000-600 -gpuid 1

It’s still a translation task so you can pick the sequence to sequence models you are used to, for example the Transformer model.

Got it. Thanks ! @guillaumekln
I ran the Transformer model for training and I’m getting pretty good eval BLEU score of around ~32.00 (4000 steps and counting). I don’t think they have posted the eval BLEU to compare, I’m guessing that is what they got.
To replicate the table results, my guess is to manually split test-multi-src.tok and test-multi-tgt.tok files into 20 language pairs (40 files total) and run infer (20 times one for each language pair) on this split files if you want to avoid going through the pre-processing stages for raw files again. ?

Also, they mention that we can avoid the source language token to good reasons. I did not gather what code switching means in this context ?

, but the advantage is that it is simpler and we can handle input with code-switching.

Thanks !

You should split the file anyway for a per-LP score.

As a source language is not forced, the source text can be composed of different languages and still be able to translate to a single target language (in theory).

Thanks @guillaumekln
I was successfully able to replicate these results on OpenNMT-tf with Transformer model after creating a vocab from the training files. It gave pretty good (mostly better) BLEU scores for all language pairs apart from *->RO.
Will further try to include other language pairs and check how well it performs.