OpenNMT Forum

CUDA out of memory when training the model

opennmt-py
#1

Hello,

Hi all, I’ve just started with OpenNMT-py and I’m at a point to train the test model following the instructions here (http://opennmt.net/OpenNMT-py/quickstart.html).

By running

python train.py -data data/demo -save_model demo-model

the CPU is used.

By running

python3 train.py -data data/demo -save_model demo-model -gpu_ranks 0

GPU is used, but I get this error:

RuntimeError: CUDA out of memory. Tried to allocate 279.88 MiB (GPU 0; 1.95 GiB total capacity; 736.80 MiB already allocated; 105.88 MiB free; 23.08 MiB cached)

Specs:
Ubuntu 18.04 + opennmt-py (0.8.2)
NVIDIA Corporation GM107GLM [Quadro M1000M] [10de:13b1] (rev a2) (prog-if 00 [VGA controller])
Subsystem: Hewlett-Packard Company GM107GLM [Quadro M1000M] [103c:810a]
Flags: bus master, fast devsel, latency 0, IRQ 149
Memory at d3000000 (32-bit, non-prefetchable) [size=16M]
Memory at 90000000 (64-bit, prefetchable) [size=256M]
Memory at a0000000 (64-bit, prefetchable) [size=32M]
I/O ports at 3000 [size=128]
[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting

Can someone advise how to resolve this issue?

Here is the entire log:
[2019-03-26 14:55:47,213 INFO] * src vocab size = 24997
[2019-03-26 14:55:47,213 INFO] * tgt vocab size = 35820
[2019-03-26 14:55:47,213 INFO] Building model…
[2019-03-26 14:55:50,304 INFO] NMTModel(
(encoder): RNNEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(24997, 500, padding_idx=1)
)
)
)
(rnn): LSTM(500, 500, num_layers=2, dropout=0.3)
)
(decoder): InputFeedRNNDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(35820, 500, padding_idx=1)
)
)
)
(dropout): Dropout(p=0.3)
(rnn): StackedLSTM(
(dropout): Dropout(p=0.3)
(layers): ModuleList(
(0): LSTMCell(1000, 500)
(1): LSTMCell(500, 500)
)
)
(attn): GlobalAttention(
(linear_in): Linear(in_features=500, out_features=500, bias=False)
(linear_out): Linear(in_features=1000, out_features=500, bias=False)
)
)
(generator): Sequential(
(0): Linear(in_features=500, out_features=35820, bias=True)
(1): Cast()
(2): LogSoftmax()
)
)
[2019-03-26 14:55:50,304 INFO] encoder: 16506500
[2019-03-26 14:55:50,304 INFO] decoder: 41613820
[2019-03-26 14:55:50,304 INFO] * number of parameters: 58120320
[2019-03-26 14:55:50,306 INFO] Starting training on GPU: [0]
[2019-03-26 14:55:50,306 INFO] Start training loop and validate every 10000 steps…
[2019-03-26 14:55:50,378 INFO] Loading dataset from data/demo.train.0.pt, number of examples: 10000
Traceback (most recent call last):
File “train.py”, line 109, in
main(opt)
File “train.py”, line 39, in main
single_main(opt, 0)
File “/home/sebastjan/projekti/opennmt/OpenNMT-py/onmt/train_single.py”, line 116, in main
valid_steps=opt.valid_steps)
File “/home/sebastjan/projekti/opennmt/OpenNMT-py/onmt/trainer.py”, line 209, in train
report_stats)
File “/home/sebastjan/projekti/opennmt/OpenNMT-py/onmt/trainer.py”, line 329, in _gradient_accumulation
trunc_size=trunc_size)
File “/home/sebastjan/projekti/opennmt/OpenNMT-py/onmt/utils/loss.py”, line 159, in call
loss.div(float(normalization)).backward()
File “/home/sebastjan/.local/lib/python3.6/site-packages/torch/tensor.py”, line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File “/home/sebastjan/.local/lib/python3.6/site-packages/torch/autograd/init.py”, line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 113.75 MiB (GPU 0; 1.95 GiB total capacity; 625.72 MiB already allocated; 57.00 MiB free; 23.28 MiB cached)

EDIT
Both tests under https://pytorch.org/get-started/locally/#linux-verification also pass

Thank you.
Sebastjan

0 Likes

(Guillaume Klein) #2

Hi,

2GB of GPU memory is usually not enough, even for the quickstart training. If you still want to proceed, you should reduce some options like the batch size and/or the model size.

1 Like

#3

Thank you, we have now moved to a machine with more resources and it works fine :slightly_smiling_face:

Best
Sebastjan

0 Likes