"CUDA out of memory" during training with high preprocess values

ymoslem · July 13, 2019, 3:58am

Data Size

25 Million Segments with 881893 unique source tokens and 1096054 unique target tokens, with both short and long sentences (some with more than 1000 tokens).

Machine Specifications:

AWS p2.8xlarge with 8 GPUs of 12 GB memory each (total 96 GB)

Issue:

Training cannot start. I get an error “RuntimeError: CUDA out of memory.”

Preprocess options I used (successfully):

python3 preprocess.py -train_src source.txt -train_tgt target.txt -valid_src validsource.txt -valid_tgt validtarget.txt -save_data fren -src_vocab_size 881893 -tgt_vocab_size 1096054 -src_seq_length 1500 -tgt_seq_length 1500 -dynamic_dict -share_vocab -log_file "log.txt"

Training options (causing the error):

Recommended Transformer options (I tried both with 4 GPUs and 8 GPUs).

CUDA_VISIBLE_DEVICES= 0,1,2,3,4,5,6,7 python3 train.py -data fren -save_model fren-model -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 200000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -keep_checkpoint 5 -log_file log.train -world_size 8 -gpu_ranks 0 1 2 3 4 5 6 7

I tried also to decrease the batch_size to the half on 8 GPUs, but still get the “CUDA out of memory” error.

I understand that my preprocessing values of src_vocab_size, -tgt_vocab_size, -src_seq_length, -tgt_seq_length are so huge, but I wanted to:

avoid out-of-vocabulary
deal with very long sentences

because when I preprocessed with lower values, the translation had issues with both points.

So I will highly appreciate any recommendations to handle this situation.

Many thanks,
Yasmin

vince62s · July 13, 2019, 2:02pm

You can’t deal with such vocab sizes. You’ll need to make choices.
Either use BPE/subwords tools
Either cut the vocab to a reasonable range (30k-50k)

ymoslem · July 14, 2019, 7:02am

Many thanks, Vincent!

It seems so. I tried diverse options; even the default options of OpenNMT-py (i.e. without the Transformer model/options) cannot work.

I understand using BPE/subwords is very useful for languages with words that are actually combinations of multiple words. In your experience, for language pairs like French-English (and vice versa), does using BPE/subwords make a considerable difference?

If you do not mind, in my preprocessing command, I also used dynamic_dict and -share_vocab because I read they help with copying the source of unknowns maybe with -replace_unk - is this true? (I know that -replace_unk does not work with the Transformer model, but I was intending to run another training with the default RNN encoder/decoder as well).

Finally, I see the Transformer recommended options use:
-batch_size 4096 -batch_type tokens -normalization tokens
Is using tokens instead of sentences recommended even without the Transformer model?

Many thanks indeed,
Yasmin

vince62s · July 14, 2019, 8:48pm

Most research papers now use BPE, even with French/English, also share_vocab is widely used.

Token batch mode should work fine, it is tied to a specific model.
However, bear in mind that the batch size may have an impact on how the model learns.

ymoslem · July 15, 2019, 3:46am

Dear Vincent,

Many thanks for the valuable information!

Thanks for highlighting this!

So my next tests will include:

Using BPE/subwords
Incremental training on new data

In the past, I tried incremental training / retraining, but did not notice a real difference, but now I will try with 1) more data, and 2) maybe with the TensorFlow version to be able to update vocabulary. I believe also if the retraining is for the purpose of domain adaptation, I might use more steps when training with the specialized data.

Many thanks again!

Kind regards,
Yasmin

vince62s · July 16, 2019, 11:25am

When using BPE/subwords, there is very little gain in updating vocab, unless you want to make sure a specific word that was not in initial vocab will be taken into account. But chance is that it will be “subworded” anyway.

For retraining, it’s all in the balance between old data and new data. If you take only new data, you may break the model.

ymoslem · July 17, 2019, 9:05am

Many thanks, Vincent, for your valuable insights!

In OpenNMT-py, I use -train_from when the training is interrupted for some reason. When retraining, I preprocess the new dataset and start the training with the newly created files from preprocessing the new dataset. Is this the correct way?

As there is no option to update vocab and you are saying it might not be necessary with BPE/subwords, the question is: what does the quality gain from retraining in OpenNMT-py?

Thanks for the note! So should I add full sentences from the old dataset to the new dataset, or is it enough if both datasets are from the same domain?

Many thanks,
Yasmin

vince62s · July 17, 2019, 10:39am

when preprocessing new data for an existing model, you need to pass the old vocab file to the preprocess command otherwise it will generate a new vocab that does not match the model.

no magic recipe. depending on cases you simply add new data to old data, or add new data to some old data with a ratio, eg 1:1 or else.

Good luck.

ymoslem · July 17, 2019, 11:56am

Thanks, Vincent!

Excuse me, to which argument?

Yes, deep neural luck.

Many thanks,
Yasmin

vince62s · July 17, 2019, 12:52pm

here https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/opts.py#L235-L240
but documentation is wrong if you pass a pt file it will be ok.

ymoslem · July 17, 2019, 12:59pm

Perfect! Many thanks, Vincent!