Cannot scale well with multiple GPUs

panosk · February 18, 2021, 8:16am

I’m trying to use 8 RTX 2080 Ti GPUs but I don’t see significant performance improvements compared to 2 GPUs.

The problems start early on, when I set batch_size to 0 so I can get an auto-tuned value. The process takes way too long to complete a first try with 8704 batch size, then a second try with 4863, and then it simply gets stuck with the next batch size and I have to kill the process.

So I’m setting the batch size myself and training starts, but performance is bad. With 2 GPUs I get ~20000 source tokens/sec and ~16000 target tokens/sec, with 8 GPUs I just get +2000 tokens.

I also get this warning and a traceback when using all 8 GPUs:

WARNING:tensorflow:Large unrolled loop detected. Did you mean to use a TF loop? The following ops were created after iteration 3002: (<tf.Operation 'VarIsInitializedOp_3000
/resource' type=Placeholder>, <tf.Operation 'VarIsInitializedOp_3000' type=VarIsInitializedOp>, <tf.Operation 'LogicalAnd_3000' type=LogicalAnd>)

GPU utilization is medium, ~60%. I’ve tried various combinations of software versions but results are the same:

nvidia-driver-460
OpenNMT-tf 2.15

tensorflow 2.3.2 + CUDA 10.1 + cudnn 7.6.5
tensorflow 2.4.1 + CUDA 11.0 + cudnn 8.0.4

guillaumekln · February 18, 2021, 8:31am

Can you post the full training logs?

panosk · February 18, 2021, 8:59am

Sent them in pm, thanks.

guillaumekln · February 18, 2021, 9:26am

Are the GPU interconnected with NVLink?

For multi-GPU training to scale properly, peer-to-peer memory access should be enabled. In this mode data can be transferred from one GPU directly to another without going back to the CPU.

According to this NVIDIA comment, P2P for Turing GPUs requires a NVLink bridge:

panosk · February 18, 2021, 9:51am

They are not connected with NVLink, but I’m not sure this would solve the problem. When I built my own server with 2 RTX 2080 Ti and was considering to connect the cards with an NVLink bridge, I did some research and most benchmarks I found showed an increase of only ~10% with NVLink. So, even though there should be a penalty, I think the numbers still don’t add up for so many GPUs.

panosk · February 18, 2021, 10:34am

Would it be possible to try different strategies for distributed training? I will have access to this server until tomorrow so I could do some testing if needed.

guillaumekln · February 18, 2021, 11:52am

This was a good occasion to re-run a quick multi-GPU benchmark on a p3.16xlarge AWS instance (8 V100).

I’m using TensorFlow 2.4.1 and OpenNMT-tf 2.15.0. I trained a TransformerBig model for a few iterations with default configuration and a batch size of 4096 per GPU (gradient accumulation is disabled). I’m reporting tokens/s (source + target tokens/s):

	FP32	Speedup	FP16	Speedup
1 GPU	14.7k		36.6k
2 GPU	25.8k	x 1.76	53.4k	x 1.48
4 GPU	53.8k	x 3.66	114.2k	x 3.12
8 GPU	105.9k	x 7.20	206.2k	x 5.63

The observed speedup is mostly as expected. Speedup appears worse for FP16 training but the 1 GPU base performance is much higher than FP32.

So OpenNMT-tf and TensorFlow are able to scale properly and the issue is most likely related to your system. I think it is important that multiple GPU pairs are interconnected for fast GPU to GPU data transfer. On the AWS instance the logs report the following:

Device interconnect StreamExecutor with strength 1 edge matrix:
     0 1 2 3 4 5 6 7
0:   N Y Y Y Y N N N
1:   Y N Y Y N Y N N
2:   Y Y N Y N N Y N
3:   Y Y Y N N N N Y
4:   Y N N N N Y Y Y
5:   N Y N N Y N Y Y
6:   N N Y N Y Y N Y
7:   N N N Y Y Y Y N

But in your logs it appears that no GPU are interconnected (the matrix is full of “N”).

I think the penalty is likely to grow as you add more GPUs. I don’t have a lot of experience with this kind of setup, but I don’t think we can expect good performance on 8 GPUs without P2P enabled.

You could try training with Horovod. It offers more customization than TensorFlow but it requires a bit more work than just adding --num_gpus 8 on the command line.

panosk · February 18, 2021, 2:20pm

Thanks a lot for the tests and the pointer to horovod. Indeed, with horovod I got much better performance, ~35000 source tokens/sec and ~30000 target tokens/sec.

panosk · February 25, 2021, 7:50pm

Hi @guillaumekln ,

I’m coming back to this topic…

Now I’m using a server with 4 V100 GPUs connected with NVLink:

Device interconnect StreamExecutor with strength 1 edge matrix:
      0 1 2 3 
 0:   N Y Y Y
 1:   Y N Y Y 
 2:   Y Y N Y

Scaling without horovod is impossible, and I get abysmal performance. Using horovod I get the numbers you report: ~120k source+target tokens.

I’m training a TransformerBig model with shared embeddings:

import tensorflow as tf
import opennmt as onmt


def model():
    return onmt.models.Transformer(
        source_inputter=onmt.inputters.WordEmbedder(embedding_size=1024),
        target_inputter=onmt.inputters.WordEmbedder(embedding_size=1024),
        num_layers=6,
        num_units=1024,
        num_heads=16,
        ffn_inner_dim=4096,
        dropout=0.1,
        attention_dropout=0.1,
        ffn_dropout=0.1,
        share_embeddings=onmt.models.EmbeddingsSharingLevel.ALL,
    )

Specs:

Ubuntu 20.04
cuda 11.0
cudnn 8.0.5
nvidia-driver-450
OpenNMT-tf 2.16 (installed in a virtual env with pip)
Tensorflow 2.4.1
python 3.8.5

The problem is the GPUs are seriously under-utilized without horovod and hence the terrible performance. So, now that NVLink is out of the equation, what should I look for?

Thanks!

guillaumekln · February 25, 2021, 8:14pm

Are you installing all packages yourself or are you using a Docker image? If installing manually, can you check if libnccl2 is installed?

panosk · February 25, 2021, 8:24pm

Hmm, actually I installed libnccl2 later as a requirement for horovod, then I started training with horovod. I will stop the current training in a few minutes, try again and report back.

panosk · February 25, 2021, 9:03pm

Nope, same problem. Performance is 28k source+target tokens/sec with the GPUs dropping as low as 0% utilization.

guillaumekln · February 26, 2021, 11:51am

I think I can reproduce the issue when using online tokenization, i.e. when setting source_tokenization and target_tokenization in the configuration. It does not go as low as 28k source+target tokens/sec in my test but I can see that the GPU usage often drops to 0%.

Can you try setting a large prefetch buffer size in the configuration, for example:

train:
  prefetch_buffer_size: 10000

The data pipeline will prepare this many batches in advance that are ready to be consumed. According to my test this fixes the performance issue.

(This parameter is not documented because by default TensorFlow can auto-tune this value, but in this case it seems we need to manually set a large value.)

panosk · February 26, 2021, 12:48pm

I tried your suggestion but I see no improvement. Although at the first step there is a difference and things look promising, after that performance decreases again:

INFO:tensorflow:Step = 105100 ; steps/s = 1.73, source words/s = 57549, target words/s = 46226 ; Learning rate = 0.000193 ; Loss = 2.362551
INFO:tensorflow:Step = 105200 ; steps/s = 0.84, source words/s = 27941, target words/s = 21942 ; Learning rate = 0.000193 ; Loss = 2.492122
INFO:tensorflow:Step = 105300 ; steps/s = 0.46, source words/s = 15349, target words/s = 11928 ; Learning rate = 0.000193 ; Loss = 2.420474
INFO:tensorflow:Step = 105400 ; steps/s = 0.46, source words/s = 15374, target words/s = 11964 ; Learning rate = 0.000193 ; Loss = 2.493598
INFO:tensorflow:Step = 105500 ; steps/s = 0.46, source words/s = 15275, target words/s = 11868 ; Learning rate = 0.000192 ; Loss = 2.374078

francoishernandez · February 26, 2021, 1:36pm

Could this be a CPU or I/O bottleneck? We’ve had the case when trying to apply too many operations on the fly, with an underpowered CPU compared to the number of GPUs for instance. (on -py though, not sure about -tf)
May not be applicable here but thought it was worth mentioning.

guillaumekln · February 26, 2021, 2:41pm

@francoishernandez It’s possible this also plays a role, but I don’t think it is the main issue here because @panosk is able to get good performance when the multi-GPU training is managed by Horovod.

When enabling on-the-fly tokenization, the execution needs to move back and forth between the TensorFlow runtime and the Python runtime where the tokenization is applied. But TensorFlow multi-GPU uses multithreading and we know that Python code can not be parallelized with threads because of the Global Interpreter Lock. So I think all threads are stumbling over each other when running this Python tokenization and it becomes a bottleneck. On the other hand, Horovod is starting a separate process for each GPU.

So the current possible workarounds are:

use Horovod
tokenize the data before the training (and remove the tokenization from the configuration)

I’ll open an issue in OpenNMT-tf repository since clearly there is something to improve here.

panosk · February 26, 2021, 5:58pm

Hi @francoishernandez ,

This could be a factor, but I think not in my case --the machines I’m using are pretty beefy with Xeon processors, nvme storage, plenty of RAM.

@guillaumekln , I’ll keep using Horovod, but it’s nice we found the reason, thanks!

guillaumekln · October 11, 2021, 2:20pm

For reference, the performance issue related to the on-the-fly OpenNMT tokenization is fixed (or largely improved) with this change:

dmarin · October 29, 2021, 5:26pm

Hi everyone,

Not sure if this thread would be the most appropriate for this post, but I’m experiencing a very similar issue, just without online tokenisation.

We have recently upgraded to new machines with 4 Quadro RTX 6000 instead of 2, but the performance barely increased 2000 tokens/sec with the same data and the same model configurations (just with num_gpus=4). More concretely, with a batch size of 4096 tokens, I get around 29k tokens/sec with 2 GPUs, while with 4 GPUs I get 31k tokens/s.

As @panosk, I’m also training a TransformerBig model with shared embeddings.

It seems GPUs are well installed and recognised:

Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
     0 1 2 3 
0:   N Y Y Y 
1:   Y N Y Y 
2:   Y Y N Y 
3:   Y Y Y N

However, GPUs seem to run only half way. Looking at the power consumption, I can see this (it’s the same for all GPUs): 132W / 260W

| 59%   77C    P2   132W / 260W |  23410MiB / 24220MiB |     66%      Default |

while in a 2-GPU machine, with the same data and configuration, values are around 231W / 260W and the volatile GPU utilization is also higher (around 90-95%):

| 53%   74C    P2   231W / 260W |  23450MiB / 24220MiB |     93%      Default |

The 4-GPU machines have the same specifications as the 2-GPU ones. Versions are the following:

OpenNMT-tf 2.22.0
TensorFlow 2.4.3 + CUDA 11.0

Do you know anything I could try to pinpoint the cause of this issue?

guillaumekln · October 30, 2021, 8:16am

Hi,

There are multiple things you can try and see how it changes your performance numbers:

Increase the batch size
Enable mixed precision
Run multi GPU training with Horovod

For completeness, can you also post your training configuration?