Errors while training with Transformer: CUDA OOM and removed batches

I am trying to train system with two Transformers:

onmt_train -world_size 1 -gpu_ranks 0 -data data/demo -save_model data/demo-model \
        -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8  \
        -encoder_type transformer -decoder_type transformer -position_encoding \
        -train_steps 200000  -max_generator_batches 2 -dropout 0.1 \
        -batch_size 4096 -batch_type tokens -normalization tokens  -accum_count 2\
        -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
        -max_grad_norm 0 -param_init 0  -param_init_glorot \
        -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 50000

And I have two questions about training log:

[2019-12-02 10:00:12,587 INFO] At step 76891, we removed a batch - accum 1
[2019-12-02 10:00:21,219 INFO] Step 76900/200000; acc:  90.90; ppl:  1.56; xent: 0.44; lr: 0.00032; 7086/6441 tok/s;  74052 sec
Traceback (most recent call last):
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/trainer.py", line 370, in _gradient_accumulation
    trunc_size=trunc_size)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 165, in __call__
    loss, stats = self._compute_loss(batch, **shard)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 261, in _compute_loss
    loss = self.criterion(scores, gtruth)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 220, in forward
    return F.kl_div(output, model_prob, reduction='sum')
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/functional.py", line 1942, in kl_div
    reduced = torch.kl_div(input, target, reduction_enum)
RuntimeError: CUDA out of memory. Tried to allocate 626.00 MiB (GPU 0; 11.91 GiB total capacity; 8.11 GiB already allocated; 513.06 MiB free; 2.68 GiB cached)
[2019-12-02 10:00:23,780 INFO] At step 76903, we removed a batch - accum 1
[2019-12-02 10:01:08,983 INFO] Step 76950/200000; acc:  90.68; ppl:  1.56; xent: 0.44; lr: 0.00032; 7150/6597 tok/s;  74100 sec
Traceback (most recent call last):
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/trainer.py", line 370, in _gradient_accumulation
    trunc_size=trunc_size)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 165, in __call__
    loss, stats = self._compute_loss(batch, **shard)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 261, in _compute_loss
    loss = self.criterion(scores, gtruth)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 220, in forward
    return F.kl_div(output, model_prob, reduction='sum')
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/functional.py", line 1942, in kl_div
    reduced = torch.kl_div(input, target, reduction_enum)
RuntimeError: CUDA out of memory. Tried to allocate 1.02 GiB (GPU 0; 11.91 GiB total capacity; 10.10 GiB already allocated; 519.06 MiB free; 701.09 MiB cached)
[2019-12-02 10:01:41,664 INFO] At step 76985, we removed a batch - accum 0
  1. Removed batch - it is not an error, right? Sometimes Gradient Descent removes batches and if it happens rarely - it is ok?

  2. CUDA OOM. I can see that 95% of GPU memory is used during training process. The training process is the only process that occupies GPU.
    I tried to reduce “-valid_batch_size”, but it doesn’t help.
    Maybe the problem is my vocabulary:

    [2019-12-01 13:26:00,971 INFO] * src vocab size = 80739
    [2019-12-01 13:26:00,971 INFO] * tgt vocab size = 100004

It is pretty big. But I want to have all the available words in the train set.
Main question: does it harm my system or just slows it down?

  1. Yes, we added the possibility to drop some batches that would marginally produce OOMs, in order to allow training on bigger batches overall.

  2. You may want to half your batch_size and double your accum_count, this would result in approximately the same “true” batch size. It’ll be slower though.
    Also, if you want to reduce your vocab size but retain coverage of your training data, you may want to have a look at subwords methods: Using Sentencepiece/Byte Pair Encoding on Model

Ok, thanks. I’ll try it.
But for now - will this training process end successfully?
You know, I spent a lot of time already and I want to get some result. Should I wait one day more (estimated time)?

Depends on what you mean by “end successfully”. At 76k steps I think you should already have an idea of whether your model is going to be usable or not. Have you performed some inference/evaluation on already saved checkpoints?

Yes, saved checkpoint model works pretty well.
But I wasn’t sure if I can use it in my evaluation table (which I use to compare different experiments).
If I use another data and change training options (batch_size and accum_count so I don’t get an OOM errors) and get a better/worse BLEU - does it mean that my new data set is better/worse or not? Could this errors affect model performance and final BLEU?
But ok, if I understand you correctly - this type of errors are not critical for models.

Hello All,
So I am trying to train transformer NMT model, with corpus size ~5 million. Batch size 4096, accum_count 2 on Tesla v100 (1 gpu machine 16gb size). I am constantly getting cuda OOM error.
I am using sentencepiece BPE(vocab_size=32k) which is giving me final vocab size as below

Counters src:61583
[2021-02-08 09:48:27,023 INFO] Counters tgt:38519

I tried starting the training with initially 250k steps and then 150k steps. in all cases its going OOM. What could be the reason?

data:
corpus:
path_src: data/src_train.txt
path_tgt: data/tgt_train.txt
transforms: [sentencepiece,filtertoolong]
valid:
path_src: data/src_dev.txt
path_tgt: data/tgt_dev.txt
transforms: [sentencepiece,filtertoolong]

src_subword_model: model/sentencepiece_models/en_32000.model
tgt_subword_model: model/sentencepiece_models/ta_32000.model

src_seq_length: 200
tgt_seq_length: 200

skip_empty_level: silent

save_model: model/model_
save_checkpoint_steps: 10000
train_steps: 150000
valid_steps: 10000
tensorboard: true
tensorboard_log_dir: runs/onmt

world_size: 1
gpu_ranks: [0]
batch_type: “tokens”
batch_size: 4096
max_generator_batches: 2
accum_count: [2]

normalization: “tokens”
optim: “adam”
learning_rate: 0.25
adam_beta2: 0.998
decay_method: “noam”
warmup_steps: 8000
max_grad_norm: 0
param_init: 0
param_init_glorot: true
label_smoothing: 0.1

encoder_type: transformer
decoder_type: transformer
layers: 6
heads: 8
rnn_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout: [0.1]
position_encoding: true

I don’t really see how changing the number of steps would prevent any OOM.

Also, it might be your src/tgt_seq_length that’s a bit long. You can try to reduce either this or the batch_size. Or, since you’re using a V100 you might want to train with mixed precision (model_dtype: fp16) which uses significantly less GPU memory.

yes, ideally training steps should not cause OOM issue. One of times it worked by doing so hence I tried.

Yes, I am trying by reducing the length of src/tgt.