Data Size
25 Million Segments with 881893 unique source tokens and 1096054 unique target tokens, with both short and long sentences (some with more than 1000 tokens).
Machine Specifications:
AWS p2.8xlarge with 8 GPUs of 12 GB memory each (total 96 GB)
Issue:
Training cannot start. I get an error “RuntimeError: CUDA out of memory.”
Preprocess options I used (successfully):
python3 preprocess.py -train_src source.txt -train_tgt target.txt -valid_src validsource.txt -valid_tgt validtarget.txt -save_data fren -src_vocab_size 881893 -tgt_vocab_size 1096054 -src_seq_length 1500 -tgt_seq_length 1500 -dynamic_dict -share_vocab -log_file "log.txt"
Training options (causing the error):
Recommended Transformer options (I tried both with 4 GPUs and 8 GPUs).
CUDA_VISIBLE_DEVICES= 0,1,2,3,4,5,6,7 python3 train.py -data fren -save_model fren-model -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 200000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -keep_checkpoint 5 -log_file log.train -world_size 8 -gpu_ranks 0 1 2 3 4 5 6 7
I tried also to decrease the batch_size to the half on 8 GPUs, but still get the “CUDA out of memory” error.
I understand that my preprocessing values of src_vocab_size, -tgt_vocab_size, -src_seq_length, -tgt_seq_length are so huge, but I wanted to:
- avoid out-of-vocabulary
- deal with very long sentences
because when I preprocessed with lower values, the translation had issues with both points.
So I will highly appreciate any recommendations to handle this situation.
Many thanks,
Yasmin