I am trying to train system with two Transformers:
onmt_train -world_size 1 -gpu_ranks 0 -data data/demo -save_model data/demo-model \
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 \
-encoder_type transformer -decoder_type transformer -position_encoding \
-train_steps 200000 -max_generator_batches 2 -dropout 0.1 \
-batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2\
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
-max_grad_norm 0 -param_init 0 -param_init_glorot \
-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 50000
And I have two questions about training log:
[2019-12-02 10:00:12,587 INFO] At step 76891, we removed a batch - accum 1
[2019-12-02 10:00:21,219 INFO] Step 76900/200000; acc: 90.90; ppl: 1.56; xent: 0.44; lr: 0.00032; 7086/6441 tok/s; 74052 sec
Traceback (most recent call last):
File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/trainer.py", line 370, in _gradient_accumulation
trunc_size=trunc_size)
File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 165, in __call__
loss, stats = self._compute_loss(batch, **shard)
File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 261, in _compute_loss
loss = self.criterion(scores, gtruth)
File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 220, in forward
return F.kl_div(output, model_prob, reduction='sum')
File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/functional.py", line 1942, in kl_div
reduced = torch.kl_div(input, target, reduction_enum)
RuntimeError: CUDA out of memory. Tried to allocate 626.00 MiB (GPU 0; 11.91 GiB total capacity; 8.11 GiB already allocated; 513.06 MiB free; 2.68 GiB cached)
[2019-12-02 10:00:23,780 INFO] At step 76903, we removed a batch - accum 1
[2019-12-02 10:01:08,983 INFO] Step 76950/200000; acc: 90.68; ppl: 1.56; xent: 0.44; lr: 0.00032; 7150/6597 tok/s; 74100 sec
Traceback (most recent call last):
File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/trainer.py", line 370, in _gradient_accumulation
trunc_size=trunc_size)
File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 165, in __call__
loss, stats = self._compute_loss(batch, **shard)
File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 261, in _compute_loss
loss = self.criterion(scores, gtruth)
File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/utils/loss.py", line 220, in forward
return F.kl_div(output, model_prob, reduction='sum')
File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/functional.py", line 1942, in kl_div
reduced = torch.kl_div(input, target, reduction_enum)
RuntimeError: CUDA out of memory. Tried to allocate 1.02 GiB (GPU 0; 11.91 GiB total capacity; 10.10 GiB already allocated; 519.06 MiB free; 701.09 MiB cached)
[2019-12-02 10:01:41,664 INFO] At step 76985, we removed a batch - accum 0
-
Removed batch - it is not an error, right? Sometimes Gradient Descent removes batches and if it happens rarely - it is ok?
-
CUDA OOM. I can see that 95% of GPU memory is used during training process. The training process is the only process that occupies GPU.
I tried to reduce “-valid_batch_size”, but it doesn’t help.
Maybe the problem is my vocabulary:[2019-12-01 13:26:00,971 INFO] * src vocab size = 80739
[2019-12-01 13:26:00,971 INFO] * tgt vocab size = 100004
It is pretty big. But I want to have all the available words in the train set.
Main question: does it harm my system or just slows it down?