Hello! I want to run a big transformer model on UN corpus (eng to zh) on 4 GTX 1080 ti GPUs. I followed http://opennmt.net/OpenNMT-py/FAQ.html#how-do-i-use-the-transformer-model-do-you-support-multi-gpu and changed some parameters based on the paper. However, I am not sure about parameters like max_generator_batches, warmup_steps, etc.
Can someone share their experience of using Big Transformer?
python train.py -data /path/to/data -save_model /path/to/models
-layers 6 -rnn_size 1024 -word_vec_size 1024 -transformer_ff 4096 -heads 16
-encoder_type transformer -decoder_type transformer -position_encoding
-train_steps 3000000 -max_generator_batches 2 -dropout 0.3
-batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2
-max_grad_norm 0 -param_init 0 -param_init_glorot
-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000
-world_size 4 -gpu_ranks 0 1 2 3