Training don't start with GPU, but running fine with CPU, please review command once

Ravneet · June 22, 2019, 6:45pm

I am using Titan V for training. I think there is a Problem with the command I am using.

If I use CPU with below command
“python train.py -src_word_vec_size 200 -tgt_word_vec_size 200 -data data/model -save_model sum_eng-model -batch_size 64 -valid_steps 5000 -train_steps 100000 -report_every 50”
Traning start with CPU, slow but fine.

But when I try with GPU, with below command
“python train.py -src_word_vec_size 200 -tgt_word_vec_size 200 -data data/model -save_model sum_eng-model -save_checkpoint_steps 100 -world_size 2 -gpu_ranks 1 -batch_size 32 -valid_steps 1000 -train_steps 100000 -report_every 1”

Here I even reduced the batch size, checkpoint step and report step, but nothing happens, even description of the model is not showing. (I have GTX 1070 as GPU 0, so using -gpu_ranks 1 for titan)
Am I using right command???
Many thanks in advance.

ymoslem · June 22, 2019, 6:51pm

Dear Ravneet,

No, if you have one GPU, then it is -gpu_ranks 0

You might also need to start your command with CUDA_VISIBLE_DEVICES=0

Kind regards,
Yasmin

Ravneet · June 22, 2019, 6:57pm

Hi Yasmin, Thanks for your reply, Actually I have 2 GPU, wants to start with 2nd, so I wrote -gpu_ranks 1.
You mean to say I should add CUDA_VISIBLE_DEVICES=1, like,

“python train.py CUDA_VISIBLE_DEVICES=1 -src_word_vec_size 200 -tgt_word_vec_size 200 -data data/model -save_model sum_eng-model -save_checkpoint_steps 100 -world_size 2 -gpu_ranks 1 -batch_size 32 -valid_steps 1000 -train_steps 100000 -report_every 1”

Is above one right command??

ymoslem · June 22, 2019, 7:08pm

Dear Ravneet,

If you want to use the 2nd one then use: CUDA_VISIBLE_DEVICES=1 and actually add it at the very beginning of the command before python train.py.

Kind regards,
Yasmin

Ravneet · June 22, 2019, 7:53pm

Hey Yasmin,
I tried this command
“CUDA_VISIBLE_DEVICES=1 python train.py -src_word_vec_size 200 -tgt_word_vec_size 200 -data data/model -save_model sum_eng-model -save_checkpoint_steps 100 -world_size 2 -gpu_ranks 1 -batch_size 32 -valid_steps 1000 -train_steps 10000 -report_every 1”

but same, nothing happed, even the description of model is not even showing. waited for 30mins.
I checked with “nvidia-smi” command, its showing me both GPUs, cuda 9 is also availabe.
Any other solution??

ymoslem · June 22, 2019, 9:27pm

Dear Ravneet,

I noticed something else in your command; -world_size 2 is not correct as long as you use only one GPU. Change it to -world_size 1

Kind regards,
Yasmin

Ravneet · June 22, 2019, 9:47pm

Hey Yashmin, Command is working after this change. Thanks.

ngdave · September 26, 2019, 7:54am

Hi Yasmin,
Is it possible to train a model with intel GPU? And if so what changes should one make on the cofig file at the rest server API?
Kind regards.

ymoslem · September 26, 2019, 9:09am

Hi David!

As far as I know, a GPU must support CUDA; otherwise, you can use a CPU.

Please check this link " What type of computer do I need to train with?"

All the best,
Yasmin

ngdave · September 27, 2019, 8:39am

Thanks for the link.