Big Transformer model parameters

Hello! I want to run a big transformer model on UN corpus (eng to zh) on 4 GTX 1080 ti GPUs. I followed and changed some parameters based on the paper. However, I am not sure about parameters like max_generator_batches, warmup_steps, etc.

Can someone share their experience of using Big Transformer?

python -data /path/to/data -save_model /path/to/models
-layers 6 -rnn_size 1024 -word_vec_size 1024 -transformer_ff 4096 -heads 16
-encoder_type transformer -decoder_type transformer -position_encoding
-train_steps 3000000 -max_generator_batches 2 -dropout 0.3
-batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2
-max_grad_norm 0 -param_init 0 -param_init_glorot
-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000
-world_size 4 -gpu_ranks 0 1 2 3

There is no perfect recipe. Your picks are fine, but otherwise you may try to replicate the settings of some papers.
max_generator_batches is an old trick but leave it at 2 it’s ok.
warmup_steps some people use 4000 other 16000 depending on accum and nb of GPU.
One important parameter to tune is dropout,not easy, depends a lot on daatasets.