Cuda OOM for Validation


This is probably an old problem, and I did find issues opened on github about this, and the proposed fix is adding valid_batch_size, default to 32.

The strange fact is that I can train just fine with batch size 4096, but during validation it would throw CUDA OOM on me.

CUDA_VISIBLE_DEVICES=0,1 python3 -data data/nmt_s1_s2_2018oct2 -save_model save/... \
    -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8  \
    -encoder_type transformer -decoder_type transformer -position_encoding \
    -train_steps 200000  -max_generator_batches 2 -dropout 0.1 \
    -batch_size 4096 -batch_type tokens -normalization tokens  -accum_count 2 \
    -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
    -max_grad_norm 0 -param_init 0  -param_init_glorot  \
    -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -gpuid 0 1

My training data is relatively large though, the is 4.2G, and is 185M. The source sequence I have is capped at 400 words.

I tried to manually set valid_batch_size but it didn’t work. Should I decrease my training batch size instead? My last resort is to cap source sequence length at 300 words instead. Any suggestions?

(Vincent Nguyen) #2

You need to shard your data set
currently it’s -max_shard_size
but we will switch to shard_size for text (currently this one is only valid for image and audio)


Hi Vince, thank you for responding, I thought sharding (-max_shard_size) only makes a difference for preprocesisng ( and only make the preprocessing go faster. Does this also make a difference for training and validation memory usage?

(Vincent Nguyen) #4

the whole point of sharding to to take less memory during training.
but mainly for RAM.

your case might be more related to sequence length because I use the same settings and it works fine on a multi gtx 1080 ti setup.

are you on master ?

(Vincent Nguyen) #5

in your valid set, do you have very long sequences ?


Possible! but when I run, I set -src_seq_length 400, does this take care of validation sequence being too long?

I’ll probably use sharding! if that doesn’t work, I’ll default to a smaller sequence (i.e. 300 instead).

(Vincent Nguyen) #7

yes but way too long for validation because it is sentence based, not token based for validation.

did you try to reduce the validation batch size to much lower ?


Wow, valid batch size = 32 is already very small, but when I lowered it to 16, it runs even on my original dataset (max src seq length = 400, and training batch_size=4096). Thanks for the help!