CUDA Out of Memory

chopinml · March 24, 2021, 3:36am

Hello Everyone,

I know that a laptop GPU with 2GB of RAM is not suitable for this purpose, but I wonder two settings -n_sample and batch_size from quick start.

When I leave -n_sample 10000, it does not start training at all because of CUDA memory error. I tried very small batch_size like 4 and it worked afterwards.

When I don’t touch batch_size in yaml config but decrease -n_sample to 1000 it also works too. But does it mean the system only knows the words from first 1000 sentences and makes up a translation from these words only?

And my last question about real word usage, should a source / target corpus pair contain at least 100M sentences? Is this a minimum for a reasonable translate ? What is the minimum system requirements for this corpus, train_steps and estimated time for these steps etc.

I need to propose a budget roughly. So having x graphic card, with y million of parallel sentences will do the job in z weeks/months training time type of recommendations will be very helpful in my case.

Thank you in advance.

francoishernandez · March 24, 2021, 10:53am

But does it mean the system only knows the words from first 1000 sentences and makes up a translation from these words only?

Yes, and the reason it works is probably because in these first 1000 sentences there are probably no sentence long enough to overflow your memory.
You might want to add the filtertoolong transform: Always getting 'CUDA Out of Memory' Error · Issue #1907 · OpenNMT/OpenNMT-py · GitHub

You might also want to use batch_type tokens which will be more stable with regards to VRAM usage.

100M sentences is quite big. Everything from a few millions starts to be reasonable. (Of course depending on the task and the final goal.)

For training time estimates, you can check our description paper with a quite reasonable example:

chopinml · March 24, 2021, 11:22pm

Thank you, I’ve added transforms and changed batch_type tokens

save_data: toy-en-de/run/example
src_vocab: toy-en-de/run/example.vocab.src
tgt_vocab: toy-en-de/run/example.vocab.tgt
overwrite: True

data:
corpus_1:
path_src: …/corpus/toy-en-de/src-train.txt
path_tgt: …/corpus/toy-en-de/tgt-train.txt
transforms: [filtertoolong]
valid:
path_src: …/corpus/toy-en-de/src-val.txt
path_tgt: …/corpus/toy-en-de/tgt-val.txt
transforms: [filtertoolong]

world_size: 1
gpu_ranks: [0]

save_model: toy-en-de/run/model
save_checkpoint_steps: 500
train_steps: 30000
valid_steps: 1000
batch_type: tokens

Can this config produce a good result? 30.000 train steps with batch_type: tokens for this toy example? Because in previous run almost every pred_1000.txt outputs were like this:

[2021-03-25 01:05:15,279 INFO]
SENT 2729: [‘But’, ‘transmissions’, ‘are’, ‘stronger’, ‘when’, ‘devices’, ‘are’, ‘downloading’, ‘or’, ‘sending’, ‘data’, ‘.’]
PRED 2729: Eine sind wir wir wir wir wir die , die , die , die , dass wir die , die , die , dass wir die , die , dass wir die , die , dass wir die , die , dass wir dass wir die , die , dass wir dass wir die , die , dass wir dass wir die , die , dass wir dass wir die , die , dass wir dass wir die , die , dass wir dass wir die , die , dass wir dass wir die , die , dass wir die
PRED SCORE: -270.4714

francoishernandez · March 25, 2021, 9:11am

You need to also change the value of batch_size, because the default value of 64 is in “sents”, not “tokens”. You can set 2048 for instance to get started.

Training on this dataset won’t yield any good result. It’s a toy dataset to check everything runs properly, not to build a proper translation model.
A complete example for a more realistic translation task is available here: Translation — OpenNMT-py documentation

chopinml · March 25, 2021, 4:13pm

Thank you François, I will set it 2048 and play it a little bit. Not expecting good translations but want to see some progress then it is O.K.

The real world translation example seems to have Linux required with shell scripts. I will do my best.

Thank you so much for your help and fast responses.

francoishernandez · March 25, 2021, 4:28pm

Thank you François, I will set it 2048 and play it a little bit. Not expecting good translations but want to see some progress then it is O.K.

You won’t see much progress, the dataset not being fit to the task.

The real world translation example seems to have Linux required with shell scripts. I will do my best.