World_size 2 and gpu-ranks 0 1 lead to two GPU processes per GPU

This question is mainly to check whether I am doing something wrong. I am running a commando on multiple GPUs, it goes like this:

CUDA_VISIBLE_DEVICES=2,3 ... -world_size 2 -gpu_ranks 0 1

At first this launches one process for each GPU up until the training loop actually starts.

|    2     27032      C   ...nvs/nfr-experiments-3R5lX5O6/bin/python  1435MiB |
|    3     27033      C   ...nvs/nfr-experiments-3R5lX5O6/bin/python  1435MiB |

After that, I can see that there are two processes for each GPU:

|    2     25484      C   ...nvs/nfr-experiments-3R5lX5O6/bin/python 11341MiB |
|    2     25486      C   ...nvs/nfr-experiments-3R5lX5O6/bin/python  1065MiB |
|    3     25485      C   ...nvs/nfr-experiments-3R5lX5O6/bin/python 10321MiB |
|    3     25486      C   ...nvs/nfr-experiments-3R5lX5O6/bin/python  1065MiB |

This does not happen when only using one GPU, in which case only one process is created. In the past I have built my custom NLP models while also using DistributedDataParallel, and I have not encountered this behaviour before so I am wondering whether I am doing something wrong with the commands.

There is also a speed difference where a single GPU is faster than using two GPUs, although I am not sure how to read the tok/s number.

  • One process: Step 250/1000000; acc: 8.07; ppl: 1620.20; xent: 7.39; lr: 0.00003; 15592/11680 tok/s; 96 sec
  • Two processes: Step 250/1000000; acc: 8.02; ppl: 1382.01; xent: 7.23; lr: 0.00003; 24310/20151 tok/s; 112 sec

It’s expected to have two processes per gpu: one for the “batch producer” and one for the “true” training process.
Basically, the batch producer lives in the main process, which loads all the data, and sends batches to each GPU process via a multiprocessing.Queue.

E.g. in your example, pid 25486 which is on each GPU is in fact the batch producer.

See code in
Was introduced in

As for the performance gap, multi GPU does not necessarily scale linearly.

Makes sense, though I am surprised by the performance. I am aware that multi GPU does not scale linearly, but in my case we can see that one GPU is faster than two GPUs. That should never be the case, right? I often work with queues on the CPU level and that works well, although the difference between one core without the separate queue but direct loading and two cpus with a queue are often negligible or negative as in this case. When using a single process, is all data read into (CPU) memory before training? Maybe that is the issue: reading all in memory or just not needing a queue’s putting and getting may already save a lot of time. I wonder whether the queue actually adds anything instead of letting pytorch handle the data loading?

With two gpus you’re handling twice as much data per step. So you should compare wall times for one process / step 500 vs two processes / step 250.
The tok/s values are quite convenient for such comparison, as you can compare the data “flow rate” of your training. Here, it means you’re handling 15.6k/11.7k tokens (source/target) per second in single GPU, vs 24.3k/20.2k in dual GPU, hence the latter is faster (in terms of data seen, not steps performed).

1 Like

That clarifies things, thanks!!