OpenNMT Forum

Errors while training with Transformer: CUDA OOM and removed batches

I am trying to train system with two Transformers:

onmt_train -world_size 1 -gpu_ranks 0 -data data/demo -save_model data/demo-model \
        -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8  \
        -encoder_type transformer -decoder_type transformer -position_encoding \
        -train_steps 200000  -max_generator_batches 2 -dropout 0.1 \
        -batch_size 4096 -batch_type tokens -normalization tokens  -accum_count 2\
        -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
        -max_grad_norm 0 -param_init 0  -param_init_glorot \
        -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 50000

And I have two questions about training log:

[2019-12-02 10:00:12,587 INFO] At step 76891, we removed a batch - accum 1
[2019-12-02 10:00:21,219 INFO] Step 76900/200000; acc:  90.90; ppl:  1.56; xent: 0.44; lr: 0.00032; 7086/6441 tok/s;  74052 sec
Traceback (most recent call last):
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/trainer.py", line 370, in _gradient_accumulation
    trunc_size=trunc_size)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 165, in __call__
    loss, stats = self._compute_loss(batch, **shard)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 261, in _compute_loss
    loss = self.criterion(scores, gtruth)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 220, in forward
    return F.kl_div(output, model_prob, reduction='sum')
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/functional.py", line 1942, in kl_div
    reduced = torch.kl_div(input, target, reduction_enum)
RuntimeError: CUDA out of memory. Tried to allocate 626.00 MiB (GPU 0; 11.91 GiB total capacity; 8.11 GiB already allocated; 513.06 MiB free; 2.68 GiB cached)
[2019-12-02 10:00:23,780 INFO] At step 76903, we removed a batch - accum 1
[2019-12-02 10:01:08,983 INFO] Step 76950/200000; acc:  90.68; ppl:  1.56; xent: 0.44; lr: 0.00032; 7150/6597 tok/s;  74100 sec
Traceback (most recent call last):
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/trainer.py", line 370, in _gradient_accumulation
    trunc_size=trunc_size)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 165, in __call__
    loss, stats = self._compute_loss(batch, **shard)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 261, in _compute_loss
    loss = self.criterion(scores, gtruth)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 220, in forward
    return F.kl_div(output, model_prob, reduction='sum')
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/functional.py", line 1942, in kl_div
    reduced = torch.kl_div(input, target, reduction_enum)
RuntimeError: CUDA out of memory. Tried to allocate 1.02 GiB (GPU 0; 11.91 GiB total capacity; 10.10 GiB already allocated; 519.06 MiB free; 701.09 MiB cached)
[2019-12-02 10:01:41,664 INFO] At step 76985, we removed a batch - accum 0
  1. Removed batch - it is not an error, right? Sometimes Gradient Descent removes batches and if it happens rarely - it is ok?

  2. CUDA OOM. I can see that 95% of GPU memory is used during training process. The training process is the only process that occupies GPU.
    I tried to reduce “-valid_batch_size”, but it doesn’t help.
    Maybe the problem is my vocabulary:

    [2019-12-01 13:26:00,971 INFO] * src vocab size = 80739
    [2019-12-01 13:26:00,971 INFO] * tgt vocab size = 100004

It is pretty big. But I want to have all the available words in the train set.
Main question: does it harm my system or just slows it down?

  1. Yes, we added the possibility to drop some batches that would marginally produce OOMs, in order to allow training on bigger batches overall.

  2. You may want to half your batch_size and double your accum_count, this would result in approximately the same “true” batch size. It’ll be slower though.
    Also, if you want to reduce your vocab size but retain coverage of your training data, you may want to have a look at subwords methods: Using Sentencepiece/Byte Pair Encoding on Model

Ok, thanks. I’ll try it.
But for now - will this training process end successfully?
You know, I spent a lot of time already and I want to get some result. Should I wait one day more (estimated time)?

Depends on what you mean by “end successfully”. At 76k steps I think you should already have an idea of whether your model is going to be usable or not. Have you performed some inference/evaluation on already saved checkpoints?

Yes, saved checkpoint model works pretty well.
But I wasn’t sure if I can use it in my evaluation table (which I use to compare different experiments).
If I use another data and change training options (batch_size and accum_count so I don’t get an OOM errors) and get a better/worse BLEU - does it mean that my new data set is better/worse or not? Could this errors affect model performance and final BLEU?
But ok, if I understand you correctly - this type of errors are not critical for models.