GPU Out of memory

dhar · November 24, 2020, 2:41pm

Hello there,

in my training , I have used 70k parallel santences . But training stops due to en error and the error message is given below:
RuntimeError: CUDA out of memory.

Note : I have GPU of 4GB

francoishernandez · November 24, 2020, 3:03pm

Please post your full configuration as well as the full error trace.

dhar · November 24, 2020, 3:53pm

RuntimeError: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 0; 3.94 GiB total capacity; 2.82 GiB already allocated; 5.88 MiB free; 3.03 GiB reserved in total by PyTorch)

GPU : Geforce GTX 1050ti
Processor : Intel Core i5-7400
RAM : 8GB

francoishernandez · November 24, 2020, 3:59pm

I mean the configuration that you’re trying to execute.
Either your full command line or the full .yml file.
And, that’s not the full error. You have multiple lines with the whole Trace above this line.

dhar · November 24, 2020, 4:08pm

Here Is The Error

*Traceback (most recent call last):
  File "/usr/local/bin/onmt_train", line 11, in <module>
    load_entry_point('OpenNMT-py==2.0.0rc2', 'console_scripts', 'onmt_train')()
  File "/usr/local/lib/python3.8/dist-packages/OpenNMT_py-2.0.0rc2-py3.8.egg/onmt/bin/train.py", line 169, in main
    train(opt)
  File "/usr/local/lib/python3.8/dist-packages/OpenNMT_py-2.0.0rc2-py3.8.egg/onmt/bin/train.py", line 154, in train
    train_process(opt, device_id=0)
  File "/usr/local/lib/python3.8/dist-packages/OpenNMT_py-2.0.0rc2-py3.8.egg/onmt/train_single.py", line 102, in main
    trainer.train(
  File "/usr/local/lib/python3.8/dist-packages/OpenNMT_py-2.0.0rc2-py3.8.egg/onmt/trainer.py", line 242, in train
    self._gradient_accumulation(
  File "/usr/local/lib/python3.8/dist-packages/OpenNMT_py-2.0.0rc2-py3.8.egg/onmt/trainer.py", line 366, in _gradient_accumulation
    outputs, attns = self.model(
  File "/usr/local/lib/python3.8/dist-packages/torch-1.7.0-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/OpenNMT_py-2.0.0rc2-py3.8.egg/onmt/models/model.py", line 49, in forward
    dec_out, attns = self.decoder(dec_in, memory_bank,
  File "/usr/local/lib/python3.8/dist-packages/torch-1.7.0-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/OpenNMT_py-2.0.0rc2-py3.8.egg/onmt/decoders/transformer.py", line 314, in forward
    output, attn, attn_align = layer(
  File "/usr/local/lib/python3.8/dist-packages/torch-1.7.0-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/OpenNMT_py-2.0.0rc2-py3.8.egg/onmt/decoders/transformer.py", line 93, in forward
    output, attns = self._forward(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/OpenNMT_py-2.0.0rc2-py3.8.egg/onmt/decoders/transformer.py", line 169, in _forward
    output = self.feed_forward(self.drop(mid) + query)
  File "/usr/local/lib/python3.8/dist-packages/torch-1.7.0-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/OpenNMT_py-2.0.0rc2-py3.8.egg/onmt/modules/position_ffn.py", line 35, in forward
    inter = self.dropout_1(self.relu(self.w_1(self.layer_norm(x))))
  File "/usr/local/lib/python3.8/dist-packages/torch-1.7.0-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch-1.7.0-py3.8-linux-x86_64.egg/torch/nn/modules/linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "/usr/local/lib/python3.8/dist-packages/torch-1.7.0-py3.8-linux-x86_64.egg/torch/nn/functional.py", line 1692, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 0; 3.94 GiB total capacity; 2.82 GiB already allocated; 28.06 MiB free; 3.03 GiB reserved in total by PyTorch)*

Here is .yml file:

## Where the samples will be written
save_data: run/example
## Where the vocab(s) will be written
src_vocab: run/example.vocab.src
tgt_vocab: run/example.vocab.tgt
# Prevent overwriting existing files in the folder
overwrite: False
data:
    corpus_1:
        path_src: bn.txt
        path_tgt: en.txt
    valid:
        path_src: bndev.txt
        path_tgt: endev.txt
save_model: run/model.bn-en
save_checkpoint_steps: 10000
keep_checkpoint: 10
seed: 3435
train_steps: 500000
valid_steps: 10000
warmup_steps: 8000
report_every: 100

decoder_type: transformer
encoder_type: transformer
word_vec_size: 512
rnn_size: 512
layers: 6
transformer_ff: 2048
heads: 8

accum_count: 8
optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0

batch_size: 4096
batch_type: tokens
normalization: tokens
dropout: 0.1
label_smoothing: 0.1

max_generator_batches: 2

param_init: 0.0
param_init_glorot: 'true'
position_encoding: 'true'

world_size: 1
gpu_ranks:
- 0
```yml

francoishernandez · November 24, 2020, 5:29pm

Reduce your batch_size to 2048 tokens for instance. You can also reduce the model dimensions since your dataset is quite small:

reduce layers
reduce word_vec_size and rnn_size (should be equal)
if you reduce the dimensions, you may also reduce the number of heads

dhar · November 25, 2020, 4:25am

An absurd error occurred !

According to your reply , i edited my .yml file and the training started . I set training for 2000 steps . When almost 1600 steps were done , Load shedding occurred .

Then again restart my training when power comes . But this time it gives me following error:

onmt_train -config bnen.yaml
[2020-11-25 10:20:18,107 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2020-11-25 10:20:18,107 WARNING] Corpus corpus_1’s weight should be given. We default it to 1 for you.
[2020-11-25 10:20:18,107 INFO] Missing transforms field for valid data, set to default: [].
[2020-11-25 10:20:18,108 INFO] Parsed 2 corpora from -data.
[2020-11-25 10:20:18,108 INFO] Get special vocabs from Transforms: {‘src’: set(), ‘tgt’: set()}.
[2020-11-25 10:20:18,108 INFO] Loading vocab from text file…
[2020-11-25 10:20:18,108 INFO] Loading src vocabulary from run/example.vocab.src
[2020-11-25 10:20:18,297 INFO] Loaded src vocab has 78558 tokens.
[2020-11-25 10:20:18,320 INFO] Loading tgt vocabulary from run/example.vocab.tgt
[2020-11-25 10:20:18,398 INFO] Loaded tgt vocab has 62502 tokens.
[2020-11-25 10:20:18,416 INFO] Building fields with vocab in counters…
[2020-11-25 10:20:18,519 INFO] * tgt vocab size: 50004.
[2020-11-25 10:20:18,605 INFO] * src vocab size: 50002.
[2020-11-25 10:20:18,609 INFO] * src vocab size = 50002
[2020-11-25 10:20:18,609 INFO] * tgt vocab size = 50004
Traceback (most recent call last):
File “/usr/local/bin/onmt_train”, line 11, in
load_entry_point(‘OpenNMT-py==2.0.0rc2’, ‘console_scripts’, ‘onmt_train’)()
File “/usr/local/lib/python3.8/dist-packages/OpenNMT_py-2.0.0rc2-py3.8.egg/onmt/bin/train.py”, line 169, in main
train(opt)
File “/usr/local/lib/python3.8/dist-packages/OpenNMT_py-2.0.0rc2-py3.8.egg/onmt/bin/train.py”, line 154, in train
train_process(opt, device_id=0)
File “/usr/local/lib/python3.8/dist-packages/OpenNMT_py-2.0.0rc2-py3.8.egg/onmt/train_single.py”, line 54, in main
configure_process(opt, device_id)
File “/usr/local/lib/python3.8/dist-packages/OpenNMT_py-2.0.0rc2-py3.8.egg/onmt/train_single.py”, line 19, in configure_process
torch.cuda.set_device(device_id)
File “/usr/local/lib/python3.8/dist-packages/torch-1.7.0-py3.8-linux-x86_64.egg/torch/cuda/init.py”, line 263, in set_device
torch._C._cuda_setDevice(device)
File “/usr/local/lib/python3.8/dist-packages/torch-1.7.0-py3.8-linux-x86_64.egg/torch/cuda/init.py”, line 172, in _lazy_init
torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

I have given everything starting from the command including error message

Thanks Beforehand !

francoishernandez · November 25, 2020, 8:13am

You may export CUDA_VISIBLE_DEVICES=0 for instance to explicitly make you gpu visible. Strange though that it worked the first time and not now. Did you change anything in your setup?

A simple check to see if torch is properly installed with cuda is to execute the following in a python shell:

import torch
torch.cuda.is_available()

Also, what does nvidia-smi show?

dhar · November 25, 2020, 3:53pm

Thank you ! But i have resolved this problem by installing Cuda and CuDNN.

I have another question on test set evaluation system . Should i ask it here or create a new post?

francoishernandez · November 25, 2020, 5:24pm

Thank you ! But i have resolved this problem by installing Cuda and CuDNN.

I’m not sure to understand how you could have had a CUDA out of memory in the first place if CUDA was not already installed. Anyways, if you manage to train now all is good!

A new post is a good idea since it’s not related.

dhar · November 26, 2020, 1:44am

I also don’t know !! it was running good…then the error…running again super !!