English to Russian translation with OpenNMT-py


I’ve made a series of experiments in English to Russian translation but till so far I wasn’t able to achieve higher then 12.2 BLEU score and I wonder what can be done to bring it up to 20-30.

Training and validation: Yandex-1M dataset
Testing: WMT 13-18 datasets.

List of experiments ranging from the best to the worst BLEU score:

| Preprocessing | Model | BLEU score |

1 | no preprocessing | Transformer at 1 GPU | 8.2-12.2 |
2 | no preprocessing | deeper architecture ( enc layers 3, enc rnn size 800, dec layers 2, dec rnn size 800) | 8.1-12.1 |
3 | no preprocessing | standard ONMT-py | 8-12 |
4 | Sentencepiece standard model | standard ONMT-py | 2.5 - 3.5 |

You need more data, Yandex is too small.

Thank you, Vincent, what amount of data should be more or less sufficient ? 5M ? 10M?

with 5M you will start to have good results.

1 Like


I’d like to provide an update on experimenting with larger datasets for En-Ru pair.

Exp. 1: I took 7M sentences from Subtitles 18 and applied perl.tokenizer to train, val and test data along with all the standard basic OpenNMT-py preprocessing and train parameters. BLEU on WMT13-18 is 1.15-1.91
Exp. 2: I took 6 large datasets (Wikipeida, Ted, Globalvoices, Newcom11, MultiUN and Subtitles 18) and selected 5M sentences with the help of Cynical Selection of language model training data (https://github.com/allo-media/cynical-selection). BLEU is 3.16-5.37.

What am I doing wrong ? What do I need to do to bring up BLEU at least to 20 ?

Thank you,

read this


Hello! How is your research going on?

As I work on this task too, I would like to share the results of my experiment. For a training set I joined 5 whole corpora (Yandex, TED, Wikipedia, Newscom11, EUbookshop) and also selected subsets from OpenSubtitles 2018 (3.5M) and from ParaCrawl (730.000). I used Cynical Selection (thank you for the tip). Overall size of training set I got is 6.1M.
Then I transliterated russian corpus to learn joint BPE, as it described in Sennrich et al., 2016.

I trained the Transformer model with recommended configuration for 28000 steps (approx. 2 epochs),
tested it on newstest2017 and got 32.93 BLEU score. Ensemble with 21000 step checkpoint gives a little bit more: 33.19.
NOTE: Mentioned BLEU scores were calculated for tokenized newstest. I came across sacrebleu later and that score is 25.3 for checkpoint from 43500 step.

I regret I can’t try averaged model because I set too big save_checkpoint_steps value and got only 3 checkpoints.

1 Like

I doubt 28000 steps = 18 epochs.
How many GPUs are you using ?
you should get > 35.5 or more on NT17

Yes, you’re right. I confused tokens with sentences.
I’m using one Tesla K80 (NC6 instance on MS Azure).
After my previous message i went further to 43500 steps. Sacrebleu score is 25.3 for that checkpoint. So, as you say, I could get even more?
I’m using my free trial on Azure so I’m thinking now to spend money remained on experiment with RNN architecture, then I could try ensemble of Transformer and RNN.