Transformer OpenNMT-py standard hyperparameters + pretrained model


I have some questions regarding the standard hyperparameters for training Transformer in OpenNMT-py and the pre-trained model from the website.

  1. I know the FAQ lists the hyperparameters ( However, there is an inconsistency between the yellow block of hyperparameters and the “Here are what each of the parameters mean:” part just below it.
    The “-accum_count” differs. Should it be 2, or 4, given the setup in the yellow block? If I understand it correctly, this greatly affects the number of epochs that fit into the 200,000 train steps.

  2. The pre-trained Transformer model for WMT EN->DE ( is said to be trained using those settings. I was able to compute the same BLEU scores as listed on the website using this model, but when training a new model using the same downloadable preprocessed data and training settings, my scores never got as high up! I am trying to figure out why.

  • The accum count from Q1 may have affected the number of epochs: for how many epochs was the pre-trained model trained?
  • When downloading the model its name is “”. Was the model averaged over the last 80,000 steps? (because “-save_checkpoint_steps” is 10000 according to the website)

It would be amazing if someone could help! :slight_smile:

cc @vince62s

What is your set up ? number of gpu, type of gpu ?

When I did this run, the code base was different it was epoch based.

My suggestion now is to follow the original paper.

Batch_size about 25k (actually you can do more), 100K steps for this, a bit more will give you better results.
So if you have 4 GPU (11GB ones) then you can fit 4096 token.
in this config, if you set accum 2, it will be 4x4x2=32K token batches.
100k steps should be fine.

Thanks a lot for your response! Yes, 4x 11gb, so your advice fits nicely.

However, could you please provide me with a bit more info about the pretrained model? So then was it trained according to the settings listed in the yellow block (accum_count = 2)? And for how many epochs was it trained, and over how many steps did you average?

yes it was these settings.

average was on 10 checkpoints each 1000 steps IIRC