OpenNMT Forum

Training get stuck and GPU is 100% used at 257100 or 507100 steps

Hi, everyone. I got a very strange problem.
I am training a large dataset whith this script:

python train.py -data ez/ze/ze -save_model ez/model/ze-model -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 1500000 -max_generator_batches 2 -dropout 0.1 -batch_size 3008 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -log_file ze3log.train -keep_checkpoint 200 -world_size 2 -gpu_ranks 0 1

but this training will get stuck at 257100 steps.
and then I use ctrl+c quit the training, and continue with -train_from ; then it will get stuck at 507100 steps.

when it’s stucked. nvidia-smi display like (GPU-Util all 100%):

and the log stopped with this line:

[2019-06-11 22:16:00,046 INFO] Step 507100/1500000; acc: 64.91; ppl: 4.31; xent: 1.46; lr: 0.00012; 11713/13772 tok/s; 223196 sec

this problem is inevitable at 257100 or 507100 steps.
I hope someone can help me .thanks.

it’s annoying but out of many runs we experienced something similiar on a specific language pair without finding the reason or a solution.