Improving the gpuid option

There were discussions to improve the -gpuid option as it can be confusing, especially when using multiple GPUs. See for example:

https://github.com/OpenNMT/OpenNMT/issues/58

What policy should we adopt?


My opinion is to convert this option to a boolean flag: simply -gpu for example.

  • When you have a single GPU (the most frequent use case), -gpuid already acts as a boolean.
  • When you have multiple GPUs, you generally want to use CUDA_VISIBLE_DEVICES to not allocate memory on every GPUs. Then you usually set -gpuid 1 and it also acts as a boolean.

Should we go this way? Obviously it is a breaking change but we are still allowing that per semantic versioning.

Hmm, I’m not sure I like this.

Do we need to break -gpuid? -gpuid allocates a very small amount of memory on the other GPUs. My group has been fine using -gpuid 2 etc. I know CUDA_VISIBLE_DEVICES is better, but do we need to break the current option?

Or should we drop the use of CUDA_VISIBLE_DEVICES and support a list of comma-separated identifiers in -gpuid as I proposed in the issue?

That being said, not using CUDA_VISIBLE_DEVICES also spams the output of nvidia-smi on multi GPU servers. Not so nice…

@jean.senellart What do you think?

the amount is not so small - it is about 250Mo which means that with 8 processes on 8 GPU we are wasting 2Gb memory which is too high.
I was about to say that we can hide CUDA_VISIBLE_DEVICE by using torch.setenv before loading cutorch but there is something new: probably connected to the automatic use of THC_ALLOCATOR in torch - there is no more default memory crunch on recent versions of Torch. We have a first 297M use only on the GPU we use, and only at the first mem allocation - @guillaumekln, can you double check?

In that case, let us drop completely CUDA_VISIBLE_DEVICES, but also -nparallel and we can extend -gpuid syntax to -gpuid ID1,ID2,ID..n. There is a little bit of work to do in Parallel.lua though.

Indeed, recent Torch versions no more allocate memory on every GPUs, with or without THC_CACHING_ALLOCATOR enabled. Here was my simple test:

require('cutorch')
cutorch.setDevice(1)
local t = torch.Tensor(1):cuda()
io.read()

So this is a good news and we can now support a list of comma-separated identifiers with -gpuid.