This is probably an old problem, and I did find issues opened on github about this, and the proposed fix is adding
valid_batch_size, default to 32.
The strange fact is that I can train just fine with batch size 4096, but during validation it would throw CUDA OOM on me.
CUDA_VISIBLE_DEVICES=0,1 python3 train.py -data data/nmt_s1_s2_2018oct2 -save_model save/... \ -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 \ -encoder_type transformer -decoder_type transformer -position_encoding \ -train_steps 200000 -max_generator_batches 2 -dropout 0.1 \ -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 \ -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \ -max_grad_norm 0 -param_init 0 -param_init_glorot \ -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -gpuid 0 1
My training data is relatively large though, the
train.1.pt is 4.2G, and
valid.1.pt is 185M. The source sequence I have is capped at 400 words.
I tried to manually set
valid_batch_size but it didn’t work. Should I decrease my training batch size instead? My last resort is to cap source sequence length at 300 words instead. Any suggestions?