Cuda OOM for Validation

windweller1 · October 7, 2018, 7:34pm

This is probably an old problem, and I did find issues opened on github about this, and the proposed fix is adding valid_batch_size, default to 32.

The strange fact is that I can train just fine with batch size 4096, but during validation it would throw CUDA OOM on me.

CUDA_VISIBLE_DEVICES=0,1 python3 train.py -data data/nmt_s1_s2_2018oct2 -save_model save/... \
    -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8  \
    -encoder_type transformer -decoder_type transformer -position_encoding \
    -train_steps 200000  -max_generator_batches 2 -dropout 0.1 \
    -batch_size 4096 -batch_type tokens -normalization tokens  -accum_count 2 \
    -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
    -max_grad_norm 0 -param_init 0  -param_init_glorot  \
    -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -gpuid 0 1

My training data is relatively large though, the train.1.pt is 4.2G, and valid.1.pt is 185M. The source sequence I have is capped at 400 words.

I tried to manually set valid_batch_size but it didn’t work. Should I decrease my training batch size instead? My last resort is to cap source sequence length at 300 words instead. Any suggestions?

vince62s · October 7, 2018, 7:37pm

You need to shard your data set
currently it’s -max_shard_size
but we will switch to shard_size for text (currently this one is only valid for image and audio)

windweller1 · October 7, 2018, 7:48pm

Hi Vince, thank you for responding, I thought sharding (-max_shard_size) only makes a difference for preprocesisng (preprocess.py) and only make the preprocessing go faster. Does this also make a difference for training and validation memory usage?

vince62s · October 7, 2018, 8:10pm

the whole point of sharding to to take less memory during training.
but mainly for RAM.

your case might be more related to sequence length because I use the same settings and it works fine on a multi gtx 1080 ti setup.

are you on master ?

vince62s · October 7, 2018, 8:24pm

in your valid set, do you have very long sequences ?

windweller1 · October 7, 2018, 8:36pm

Possible! but when I run preprocess.py, I set -src_seq_length 400, does this take care of validation sequence being too long?

I’ll probably use sharding! if that doesn’t work, I’ll default to a smaller sequence (i.e. 300 instead).

vince62s · October 7, 2018, 8:40pm

yes but way too long for validation because it is sentence based, not token based for validation.

did you try to reduce the validation batch size to much lower ?

windweller1 · October 8, 2018, 1:01am

Wow, valid batch size = 32 is already very small, but when I lowered it to 16, it runs even on my original dataset (max src seq length = 400, and training batch_size=4096). Thanks for the help!

zhaoguangxiang · December 17, 2018, 12:46pm

I have same problems when I try to train the transformer on the cnn/dailymail.