I am using the following hyperparameters provided by OpenNMT for the transformer architecture. However, I am now trying to tune this parameters so that they are suitable for low-resource langauges (approx. 400k sentences). Therefore, I am following this paper to try and mimic their “supervised settings” hyperparameter tuning (Page 6 section 4.3). So,I am going to change the heads
to 2, and the enc_layers
and the dec_layers
to 5. However, I have two concerns:
- They are using the following learning rate:
i.e., the learning rate increases linearly for 4,000 steps to 5e−4 (or 1e−3 in experiments that specify 2x lr)
Isn’t that the same as the Attention is all you need transformer architecture, hence as the openNMT transformer l?
2 -
We run experiments on between 4 and 8 Nvidia V100 GPUs with mini-batches of between 10K and 100K
From my understanding batch_size = acc_count * minibatches
and since they want to set the minibatches to 10k , and my acc_count
is 4, then I set the bacth_size
to 40k is that right?
But that is in the case I have 1 GPU, since I have 4, shall that be 10k?