Cuda OOM for Validation

This is probably an old problem, and I did find issues opened on github about this, and the proposed fix is adding valid_batch_size, default to 32.

The strange fact is that I can train just fine with batch size 4096, but during validation it would throw CUDA OOM on me.

CUDA_VISIBLE_DEVICES=0,1 python3 -data data/nmt_s1_s2_2018oct2 -save_model save/... \
    -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8  \
    -encoder_type transformer -decoder_type transformer -position_encoding \
    -train_steps 200000  -max_generator_batches 2 -dropout 0.1 \
    -batch_size 4096 -batch_type tokens -normalization tokens  -accum_count 2 \
    -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
    -max_grad_norm 0 -param_init 0  -param_init_glorot  \
    -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -gpuid 0 1

My training data is relatively large though, the is 4.2G, and is 185M. The source sequence I have is capped at 400 words.

I tried to manually set valid_batch_size but it didn’t work. Should I decrease my training batch size instead? My last resort is to cap source sequence length at 300 words instead. Any suggestions?

You need to shard your data set
currently it’s -max_shard_size
but we will switch to shard_size for text (currently this one is only valid for image and audio)

Hi Vince, thank you for responding, I thought sharding (-max_shard_size) only makes a difference for preprocesisng ( and only make the preprocessing go faster. Does this also make a difference for training and validation memory usage?

the whole point of sharding to to take less memory during training.
but mainly for RAM.

your case might be more related to sequence length because I use the same settings and it works fine on a multi gtx 1080 ti setup.

are you on master ?

1 Like

in your valid set, do you have very long sequences ?

Possible! but when I run, I set -src_seq_length 400, does this take care of validation sequence being too long?

I’ll probably use sharding! if that doesn’t work, I’ll default to a smaller sequence (i.e. 300 instead).

yes but way too long for validation because it is sentence based, not token based for validation.

did you try to reduce the validation batch size to much lower ?

Wow, valid batch size = 32 is already very small, but when I lowered it to 16, it runs even on my original dataset (max src seq length = 400, and training batch_size=4096). Thanks for the help!

I have same problems when I try to train the transformer on the cnn/dailymail.