GPU is not used

Hey there guillaumekln, I’m trying to run a model using a GPU and I saw from this post that to do so I have to call the -gpuid feature with any ID that’s not 0. I dont get any error when I run the model using the -gpuid with ID being 1 or 2 but I realized that my GPU isnt being used when I verifying with the nvidia-smi command. Im also using the pytorch version. Any pointers?

Thanks

Hi,

What is the command line you used?

This is the command line that I used.

python train.py -data cmn-eng/demo -save_model bidir_2layer_500u -src_word_vec_size 500 -tgt_word_vec_size 500 -encoder_type brnn -decoder_type rnn -rnn_size 500 -enc_layers 2 -dec_layers 2 -rnn_type LSTM -global_attention general -batch_size 64 -optim adam -adam_beta1 .9 -adam_beta2 .999 -dropout .4 -learning_rate .001 -report_every 200 -train_steps 20000 -gpuid 1

I did see from the documentation that -gpuid is deprecated and I should use -world_size and -gpu_ranks. This is my first time using a GPU to train a model so im not entirely sure what these two features do.

guillaumekln,

I was able to get it to work. Nonetheless, thank you for the fast response.

Best,
Dillon

Can you share what you find and how you fixed it for future users? Thanks.

Yeah sure. The big thing that I needed to figure out was what world_size and gpu_ranks meant. I rented out AWS p2.xlarge services that provides 1 GPU and wanted to train a NMT model with about 10mil dataset. According to this post here, to use my GPU, I would first need to set which GPU to use with the command export CUDA_VISIBLE_DEVICES. Once this is done, when writing the command line to train the model, world_size corresponds to the number of GPUs to use and gpu_rank refers to the order in which you want them to be used. Since I had 1 GPU, i had -world_size 1 -gpu_ranks 0 since CUDA_VISIBLE_DEVICES enumerates your GPU with 0. More can be found with this link here.

https://training.acceleware.com/blog/cudavisibledevices-masking-gpus

all in here:

1 Like