ymoslem
(Yasmin Moslem)
May 9, 2019, 8:46pm
1
Hello!
I want to run the transformer model with the parameters mentioned at:
http://opennmt.net/OpenNMT-py/FAQ.html#how-do-i-use-the-transformer-model-do-you-support-multi-gpu
However, the machine I am currently using has only 2 GPU. Should I adjust any of the recommended values to match the expected result?
Many thanks,
Yasmin
yaren
May 10, 2019, 10:46pm
2
I have two GPUs. So add the parameters like:
-world_size 2 -gpu_ranks 0 1
Then keep the two GPUs working.
I also use watch -n 1 -d nvidia-smi
to moniter the performance of them.
Just do it, experience is the best teacher
ymoslem
(Yasmin Moslem)
May 11, 2019, 3:30am
3
Thanks, Yaren, for your reply.
Yes, sure about -world_size 2 -gpu_ranks 0 1
I just wondered if I should change other parameters like -batch_size
yaren:
watch -n 1 -d nvidia-smi
Thanks indeed for the tip!
Kind regards,
Yasmin
yaren
May 11, 2019, 2:41pm
4
yes,
-batch_size should be changed to fit your GPU RAM,
-train_steps depends the number of pairs of your corpus.
Otherwise, you can use the command like here:
# FAQ
## How do I use Pretrained embeddings (e.g. GloVe)?
Using vocabularies from OpenNMT-py preprocessing outputs, `embeddings_to_torch.py` to generate encoder and decoder embeddings initialized with GloVes values.
the script is a slightly modified version of ylhsiehs one2.
Usage:
```
embeddings_to_torch.py [-h] [-emb_file_both EMB_FILE_BOTH]
[-emb_file_enc EMB_FILE_ENC]
[-emb_file_dec EMB_FILE_DEC] -output_file
OUTPUT_FILE -dict_file DICT_FILE [-verbose]
[-skip_lines SKIP_LINES]
[-type {GloVe,word2vec}]
```
Run embeddings_to_torch.py -h for more usagecomplete info.
This file has been truncated. show original
Copyed here for your:
python train.py -data /tmp/de2/data -save_model /tmp/extra \
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 \
-encoder_type transformer -decoder_type transformer -position_encoding \
-train_steps 200000 -max_generator_batches 2 -dropout 0.1 \
-batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 \
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
-max_grad_norm 0 -param_init 0 -param_init_glorot \
-label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 \
-world_size 4 -gpu_ranks 0 1 2 3
ymoslem
(Yasmin Moslem)
May 11, 2019, 6:28pm
5
Thanks, Yaren, for your insights!
I am currently training a model with the Transformer recommended parameters and will compare the results to the model I already trained with the default parameters for the same corpus.
marwagaser
(Marwa Gaser)
October 20, 2021, 12:24pm
6
@ymoslem I have the same question, did you change any of the hyperparameter values when training on a fewer number of GPUs?
You can also “replicate” 4 GPUs with 2 GPUs, by playing with the accum
related parameters. Search for accum_count
, gradient accumulation, etc.
1 Like
marwagaser
(Marwa Gaser)
October 20, 2021, 2:54pm
8
@francoishernandez can I divide the batch_size by 4, if I want to use 1 gpu instead of changing the acc_count? Does that make sense?
No. The batch_size is always considered “per GPU”.
When training on multiple GPUs, the “real” batch size is actually batch_size * num_gpus.
Gradient accumulation allows you to “simulate” bigger batches, hence real batch size si batch_size * num_gpus * accum_count.
So, switching from 4 GPUs to 1, you can simply multiply accum_count by 4.
If you divide your batch_size by 4, in addition to switching from 4 GPUs to 1, your “real batch size” will be 16x smaller. And, your GPU will probably be under utilized.
2 Likes