Transformer OpenNMT-py standard hyperparameters + pretrained model

vdankers · February 9, 2019, 4:27pm

Hi,

I have some questions regarding the standard hyperparameters for training Transformer in OpenNMT-py and the pre-trained model from the website.

I know the FAQ lists the hyperparameters (http://opennmt.net/OpenNMT-py/FAQ.html). However, there is an inconsistency between the yellow block of hyperparameters and the “Here are what each of the parameters mean:” part just below it.
The “-accum_count” differs. Should it be 2, or 4, given the setup in the yellow block? If I understand it correctly, this greatly affects the number of epochs that fit into the 200,000 train steps.
The pre-trained Transformer model for WMT EN->DE (http://opennmt.net/Models-py/) is said to be trained using those settings. I was able to compute the same BLEU scores as listed on the website using this model, but when training a new model using the same downloadable preprocessed data and training settings, my scores never got as high up! I am trying to figure out why.

The accum count from Q1 may have affected the number of epochs: for how many epochs was the pre-trained model trained?
When downloading the model its name is “averaged-10-epoch.pt”. Was the model averaged over the last 80,000 steps? (because “-save_checkpoint_steps” is 10000 according to the website)

It would be amazing if someone could help!

guillaumekln · February 11, 2019, 2:10pm

cc @vince62s

vince62s · February 11, 2019, 7:08pm

What is your set up ? number of gpu, type of gpu ?

When I did this run, the code base was different it was epoch based.

My suggestion now is to follow the original paper.

Batch_size about 25k (actually you can do more), 100K steps for this, a bit more will give you better results.
So if you have 4 GPU (11GB ones) then you can fit 4096 token.
in this config, if you set accum 2, it will be 4x4x2=32K token batches.
100k steps should be fine.

vdankers · February 14, 2019, 12:05am

Thanks a lot for your response! Yes, 4x 11gb, so your advice fits nicely.

However, could you please provide me with a bit more info about the pretrained model? So then was it trained according to the settings listed in the yellow block (accum_count = 2)? And for how many epochs was it trained, and over how many steps did you average?

vince62s · February 14, 2019, 7:00pm

yes it was these settings.

average was on 10 checkpoints each 1000 steps IIRC