I’m trying to use 8 RTX 2080 Ti GPUs but I don’t see significant performance improvements compared to 2 GPUs.
The problems start early on, when I set batch_size to 0 so I can get an auto-tuned value. The process takes way too long to complete a first try with 8704 batch size, then a second try with 4863, and then it simply gets stuck with the next batch size and I have to kill the process.
So I’m setting the batch size myself and training starts, but performance is bad. With 2 GPUs I get ~20000 source tokens/sec and ~16000 target tokens/sec, with 8 GPUs I just get +2000 tokens.
I also get this warning and a traceback when using all 8 GPUs:
WARNING:tensorflow:Large unrolled loop detected. Did you mean to use a TF loop? The following ops were created after iteration 3002: (<tf.Operation 'VarIsInitializedOp_3000
/resource' type=Placeholder>, <tf.Operation 'VarIsInitializedOp_3000' type=VarIsInitializedOp>, <tf.Operation 'LogicalAnd_3000' type=LogicalAnd>)
GPU utilization is medium, ~60%. I’ve tried various combinations of software versions but results are the same:
nvidia-driver-460
OpenNMT-tf 2.15
tensorflow 2.3.2 + CUDA 10.1 + cudnn 7.6.5
tensorflow 2.4.1 + CUDA 11.0 + cudnn 8.0.4
For multi-GPU training to scale properly, peer-to-peer memory access should be enabled. In this mode data can be transferred from one GPU directly to another without going back to the CPU.
According to this NVIDIA comment, P2P for Turing GPUs requires a NVLink bridge:
They are not connected with NVLink, but I’m not sure this would solve the problem. When I built my own server with 2 RTX 2080 Ti and was considering to connect the cards with an NVLink bridge, I did some research and most benchmarks I found showed an increase of only ~10% with NVLink. So, even though there should be a penalty, I think the numbers still don’t add up for so many GPUs.
Would it be possible to try different strategies for distributed training? I will have access to this server until tomorrow so I could do some testing if needed.
This was a good occasion to re-run a quick multi-GPU benchmark on a p3.16xlarge AWS instance (8 V100).
I’m using TensorFlow 2.4.1 and OpenNMT-tf 2.15.0. I trained a TransformerBig model for a few iterations with default configuration and a batch size of 4096 per GPU (gradient accumulation is disabled). I’m reporting tokens/s (source + target tokens/s):
FP32
Speedup
FP16
Speedup
1 GPU
14.7k
36.6k
2 GPU
25.8k
x 1.76
53.4k
x 1.48
4 GPU
53.8k
x 3.66
114.2k
x 3.12
8 GPU
105.9k
x 7.20
206.2k
x 5.63
The observed speedup is mostly as expected. Speedup appears worse for FP16 training but the 1 GPU base performance is much higher than FP32.
So OpenNMT-tf and TensorFlow are able to scale properly and the issue is most likely related to your system. I think it is important that multiple GPU pairs are interconnected for fast GPU to GPU data transfer. On the AWS instance the logs report the following:
Device interconnect StreamExecutor with strength 1 edge matrix:
0 1 2 3 4 5 6 7
0: N Y Y Y Y N N N
1: Y N Y Y N Y N N
2: Y Y N Y N N Y N
3: Y Y Y N N N N Y
4: Y N N N N Y Y Y
5: N Y N N Y N Y Y
6: N N Y N Y Y N Y
7: N N N Y Y Y Y N
But in your logs it appears that no GPU are interconnected (the matrix is full of “N”).
I think the penalty is likely to grow as you add more GPUs. I don’t have a lot of experience with this kind of setup, but I don’t think we can expect good performance on 8 GPUs without P2P enabled.
You could try training with Horovod. It offers more customization than TensorFlow but it requires a bit more work than just adding --num_gpus 8 on the command line.
Thanks a lot for the tests and the pointer to horovod. Indeed, with horovod I got much better performance, ~35000 source tokens/sec and ~30000 target tokens/sec.
OpenNMT-tf 2.16 (installed in a virtual env with pip)
Tensorflow 2.4.1
python 3.8.5
The problem is the GPUs are seriously under-utilized without horovod and hence the terrible performance. So, now that NVLink is out of the equation, what should I look for?
Hmm, actually I installed libnccl2 later as a requirement for horovod, then I started training with horovod. I will stop the current training in a few minutes, try again and report back.
I think I can reproduce the issue when using online tokenization, i.e. when setting source_tokenization and target_tokenization in the configuration. It does not go as low as 28k source+target tokens/sec in my test but I can see that the GPU usage often drops to 0%.
Can you try setting a large prefetch buffer size in the configuration, for example:
train:
prefetch_buffer_size: 10000
The data pipeline will prepare this many batches in advance that are ready to be consumed. According to my test this fixes the performance issue.
(This parameter is not documented because by default TensorFlow can auto-tune this value, but in this case it seems we need to manually set a large value.)
I tried your suggestion but I see no improvement. Although at the first step there is a difference and things look promising, after that performance decreases again:
Could this be a CPU or I/O bottleneck? We’ve had the case when trying to apply too many operations on the fly, with an underpowered CPU compared to the number of GPUs for instance. (on -py though, not sure about -tf)
May not be applicable here but thought it was worth mentioning.
@francoishernandez It’s possible this also plays a role, but I don’t think it is the main issue here because @panosk is able to get good performance when the multi-GPU training is managed by Horovod.
When enabling on-the-fly tokenization, the execution needs to move back and forth between the TensorFlow runtime and the Python runtime where the tokenization is applied. But TensorFlow multi-GPU uses multithreading and we know that Python code can not be parallelized with threads because of the Global Interpreter Lock. So I think all threads are stumbling over each other when running this Python tokenization and it becomes a bottleneck. On the other hand, Horovod is starting a separate process for each GPU.
So the current possible workarounds are:
use Horovod
tokenize the data before the training (and remove the tokenization from the configuration)
I’ll open an issue in OpenNMT-tf repository since clearly there is something to improve here.
Not sure if this thread would be the most appropriate for this post, but I’m experiencing a very similar issue, just without online tokenisation.
We have recently upgraded to new machines with 4 Quadro RTX 6000 instead of 2, but the performance barely increased 2000 tokens/sec with the same data and the same model configurations (just with num_gpus=4). More concretely, with a batch size of 4096 tokens, I get around 29k tokens/sec with 2 GPUs, while with 4 GPUs I get 31k tokens/s.
As @panosk, I’m also training a TransformerBig model with shared embeddings.
It seems GPUs are well installed and recognised:
Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
0 1 2 3
0: N Y Y Y
1: Y N Y Y
2: Y Y N Y
3: Y Y Y N
However, GPUs seem to run only half way. Looking at the power consumption, I can see this (it’s the same for all GPUs): 132W / 260W
while in a 2-GPU machine, with the same data and configuration, values are around 231W / 260W and the volatile GPU utilization is also higher (around 90-95%):