Tune transformer for low-resource languages

I am using the following hyperparameters provided by OpenNMT for the transformer architecture. However, I am now trying to tune this parameters so that they are suitable for low-resource langauges (approx. 400k sentences). Therefore, I am following this paper to try and mimic their “supervised settings” hyperparameter tuning (Page 6 section 4.3). So,I am going to change the heads to 2, and the enc_layers and the dec_layers to 5. However, I have two concerns:

  1. They are using the following learning rate:

i.e., the learning rate increases linearly for 4,000 steps to 5e−4 (or 1e−3 in experiments that specify 2x lr)

Isn’t that the same as the Attention is all you need transformer architecture, hence as the openNMT transformer l?

2 -

We run experiments on between 4 and 8 Nvidia V100 GPUs with mini-batches of between 10K and 100K

From my understanding batch_size = acc_count * minibatches

and since they want to set the minibatches to 10k , and my acc_count is 4, then I set the bacth_size to 40k is that right?

But that is in the case I have 1 GPU, since I have 4, shall that be 10k?