CUDA_VISIBLE_DEVICES doesn't select the correct GPU

mayaKaplansky · February 4, 2021, 4:03pm

Hi
I am running opennmt-tf from a python notebook.
I have 4 GPUs, therfore I want to run 4 training sessions in parallel, each using one GPU.
I set this:

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"   # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"]="1"

and even though, it loads GPU ID 0 and then it all crashes.

2021-02-04 15:59:23.953653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

any idea how to fix this?

guillaumekln · February 4, 2021, 4:49pm

Hi,

It seems this is the expected behavior. When setting CUDA_VISIBLE_DEVICES=1, the process will see single GPU with ID 0, but it is actually the GPU 1 on the system.

mayaKaplansky · February 4, 2021, 5:12pm

Hi I do not understand, so how do I set it to work with GPU ID 2 for example?

guillaumekln · February 4, 2021, 5:13pm

Set CUDA_VISIBLE_DEVICES=2

mayaKaplansky · February 4, 2021, 5:23pm

Hi I did that, but when running the code twice in parallel, and in each run the CUDA_VISIBLE_DEVICES= was set to a different ID then both threads crashed and I understood from the error message that they are getting mixed. Is there another way to ensure it will be separated?
I am running from a notebook and not the command line.

should it be “2” or just 2 ?

Thanks!

guillaumekln · February 4, 2021, 5:41pm

What do you mean by “threads” here?

For this to work, you should start separate notebook instances, each setting a different ID to CUDA_VISIBLE_DEVICES.