CUDA out of memory error

kameshkotwani · May 9, 2022, 5:02am

I have been working to training a multilingual model, using a huge corpa but, even after using 8GPUs I am still getting out of memory error.

Here is my config file:

save_data: run/

src_vocab_size: 500000
tgt_vocab_size: 500000

# to prevent out of memory error 
src_seq_length: 70
tgt_seq_length: 70

src_vocab: run/vocaben.src
tgt_vocab: run/vocabde.tgt


overwrite: True

# corpus post
data:
    corpus_1:
        path_src: data/description/pattr.de-en.description.en
        path_tgt: data/description/pattr.de-en.description.de
    corpus_2:
        path_src: data/abstract/pattr.de-en.abstract.en
        path_tgt: data/abstract/pattr.de-en.abstract.de
    corpus_3:
        path_src: data/claims/pattr.de-en.claims.en
        path_tgt: data/claims/pattr.de-en.claims.de
    valid:
        path_src: data/valid/engvalid.txt 
        path_tgt:  data/valid/germanvalid.txt

# to train the model
world_size: 8
gpu_ranks: [0,1,2,3,4,5,6,7]


# Batching
queue_size: 10000
bucket_size: 32768
batch_type: "tokens"
batch_size: 128
valid_batch_size: 128
batch_size_multiple: 1
max_generator_batches: 0
accum_count: [3]
accum_steps: [0]


# Optimization
model_dtype: "fp32"
optim: "adam"
learning_rate: 2
warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"


# Model
encoder_type: transformer
decoder_type: transformer
enc_layers: 3
dec_layers: 3
heads: 8
rnn_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]

average_decay: 0.0005
seed: 1234


save_model: models8gpu/
save_checkpoint_steps: 2000 
train_steps: 100000
valid_steps: 5000

Also, here is the image of out of memory error:

I need some help to fix it.

vince62s · May 9, 2022, 6:21am

you must be working on a shared system someone is using gpu 5 with an already allocated huge amount of ram.

kameshkotwani · May 9, 2022, 6:25am

I think so, I will try to ask for exclusive access and try again. Thanks for the response!

vince62s · May 9, 2022, 6:30am

also you want more tokens in your batch, smtg like 4096 or 8192, or switch to sentence mode. Valid is sentence mode so you may tray less, like 16

guillaumekln · May 9, 2022, 10:57am

The vocabulary size is also too big. You should probably divide this value by 10.

kameshkotwani · May 11, 2022, 5:26am

The problem is that if we only choose 50K vocab, then most of the preposition are coming up. That is why accuracy is coming out to be very low (35-40)%, I think if I keep a big vocab that would be able to increase the accuracy. Please let me know if I am wrong, other optimization that you would suggest, as I have done everything I could, on my end but still not able to make it work.

vince62s · May 12, 2022, 9:00am

read more on this forum about BPE or sentence piece