Cannot scale well with multiple GPUs

guillaumekln · February 18, 2021, 8:31am

Can you post the full training logs?

panosk · February 18, 2021, 8:59am

Sent them in pm, thanks.

guillaumekln · February 18, 2021, 9:26am

Are the GPU interconnected with NVLink?

For multi-GPU training to scale properly, peer-to-peer memory access should be enabled. In this mode data can be transferred from one GPU directly to another without going back to the CPU.

According to this NVIDIA comment, P2P for Turing GPUs requires a NVLink bridge:

panosk · February 18, 2021, 9:51am

They are not connected with NVLink, but I’m not sure this would solve the problem. When I built my own server with 2 RTX 2080 Ti and was considering to connect the cards with an NVLink bridge, I did some research and most benchmarks I found showed an increase of only ~10% with NVLink. So, even though there should be a penalty, I think the numbers still don’t add up for so many GPUs.

panosk · February 18, 2021, 10:34am

Would it be possible to try different strategies for distributed training? I will have access to this server until tomorrow so I could do some testing if needed.

guillaumekln · February 18, 2021, 11:52am

This was a good occasion to re-run a quick multi-GPU benchmark on a p3.16xlarge AWS instance (8 V100).

I’m using TensorFlow 2.4.1 and OpenNMT-tf 2.15.0. I trained a TransformerBig model for a few iterations with default configuration and a batch size of 4096 per GPU (gradient accumulation is disabled). I’m reporting tokens/s (source + target tokens/s):

	FP32	Speedup	FP16	Speedup
1 GPU	14.7k		36.6k
2 GPU	25.8k	x 1.76	53.4k	x 1.48
4 GPU	53.8k	x 3.66	114.2k	x 3.12
8 GPU	105.9k	x 7.20	206.2k	x 5.63

The observed speedup is mostly as expected. Speedup appears worse for FP16 training but the 1 GPU base performance is much higher than FP32.

So OpenNMT-tf and TensorFlow are able to scale properly and the issue is most likely related to your system. I think it is important that multiple GPU pairs are interconnected for fast GPU to GPU data transfer. On the AWS instance the logs report the following:

Device interconnect StreamExecutor with strength 1 edge matrix:
     0 1 2 3 4 5 6 7
0:   N Y Y Y Y N N N
1:   Y N Y Y N Y N N
2:   Y Y N Y N N Y N
3:   Y Y Y N N N N Y
4:   Y N N N N Y Y Y
5:   N Y N N Y N Y Y
6:   N N Y N Y Y N Y
7:   N N N Y Y Y Y N

But in your logs it appears that no GPU are interconnected (the matrix is full of “N”).

I think the penalty is likely to grow as you add more GPUs. I don’t have a lot of experience with this kind of setup, but I don’t think we can expect good performance on 8 GPUs without P2P enabled.

You could try training with Horovod. It offers more customization than TensorFlow but it requires a bit more work than just adding --num_gpus 8 on the command line.

panosk · February 18, 2021, 2:20pm

Thanks a lot for the tests and the pointer to horovod. Indeed, with horovod I got much better performance, ~35000 source tokens/sec and ~30000 target tokens/sec.

panosk · February 25, 2021, 7:50pm

Hi @guillaumekln ,

I’m coming back to this topic…

Now I’m using a server with 4 V100 GPUs connected with NVLink:

Device interconnect StreamExecutor with strength 1 edge matrix:
      0 1 2 3 
 0:   N Y Y Y
 1:   Y N Y Y 
 2:   Y Y N Y

Scaling without horovod is impossible, and I get abysmal performance. Using horovod I get the numbers you report: ~120k source+target tokens.

I’m training a TransformerBig model with shared embeddings:

import tensorflow as tf
import opennmt as onmt


def model():
    return onmt.models.Transformer(
        source_inputter=onmt.inputters.WordEmbedder(embedding_size=1024),
        target_inputter=onmt.inputters.WordEmbedder(embedding_size=1024),
        num_layers=6,
        num_units=1024,
        num_heads=16,
        ffn_inner_dim=4096,
        dropout=0.1,
        attention_dropout=0.1,
        ffn_dropout=0.1,
        share_embeddings=onmt.models.EmbeddingsSharingLevel.ALL,
    )

Specs:

Ubuntu 20.04
cuda 11.0
cudnn 8.0.5
nvidia-driver-450
OpenNMT-tf 2.16 (installed in a virtual env with pip)
Tensorflow 2.4.1
python 3.8.5

The problem is the GPUs are seriously under-utilized without horovod and hence the terrible performance. So, now that NVLink is out of the equation, what should I look for?

Thanks!

guillaumekln · February 25, 2021, 8:14pm

Are you installing all packages yourself or are you using a Docker image? If installing manually, can you check if libnccl2 is installed?

panosk · February 25, 2021, 8:24pm

Hmm, actually I installed libnccl2 later as a requirement for horovod, then I started training with horovod. I will stop the current training in a few minutes, try again and report back.

panosk · February 25, 2021, 9:03pm

Nope, same problem. Performance is 28k source+target tokens/sec with the GPUs dropping as low as 0% utilization.

guillaumekln · February 26, 2021, 11:51am

I think I can reproduce the issue when using online tokenization, i.e. when setting source_tokenization and target_tokenization in the configuration. It does not go as low as 28k source+target tokens/sec in my test but I can see that the GPU usage often drops to 0%.

Can you try setting a large prefetch buffer size in the configuration, for example:

train:
  prefetch_buffer_size: 10000

The data pipeline will prepare this many batches in advance that are ready to be consumed. According to my test this fixes the performance issue.

(This parameter is not documented because by default TensorFlow can auto-tune this value, but in this case it seems we need to manually set a large value.)

panosk · February 26, 2021, 12:48pm

I tried your suggestion but I see no improvement. Although at the first step there is a difference and things look promising, after that performance decreases again:

INFO:tensorflow:Step = 105100 ; steps/s = 1.73, source words/s = 57549, target words/s = 46226 ; Learning rate = 0.000193 ; Loss = 2.362551
INFO:tensorflow:Step = 105200 ; steps/s = 0.84, source words/s = 27941, target words/s = 21942 ; Learning rate = 0.000193 ; Loss = 2.492122
INFO:tensorflow:Step = 105300 ; steps/s = 0.46, source words/s = 15349, target words/s = 11928 ; Learning rate = 0.000193 ; Loss = 2.420474
INFO:tensorflow:Step = 105400 ; steps/s = 0.46, source words/s = 15374, target words/s = 11964 ; Learning rate = 0.000193 ; Loss = 2.493598
INFO:tensorflow:Step = 105500 ; steps/s = 0.46, source words/s = 15275, target words/s = 11868 ; Learning rate = 0.000192 ; Loss = 2.374078

francoishernandez · February 26, 2021, 1:36pm

Could this be a CPU or I/O bottleneck? We’ve had the case when trying to apply too many operations on the fly, with an underpowered CPU compared to the number of GPUs for instance. (on -py though, not sure about -tf)
May not be applicable here but thought it was worth mentioning.

guillaumekln · February 26, 2021, 2:41pm

@francoishernandez It’s possible this also plays a role, but I don’t think it is the main issue here because @panosk is able to get good performance when the multi-GPU training is managed by Horovod.

When enabling on-the-fly tokenization, the execution needs to move back and forth between the TensorFlow runtime and the Python runtime where the tokenization is applied. But TensorFlow multi-GPU uses multithreading and we know that Python code can not be parallelized with threads because of the Global Interpreter Lock. So I think all threads are stumbling over each other when running this Python tokenization and it becomes a bottleneck. On the other hand, Horovod is starting a separate process for each GPU.

So the current possible workarounds are:

use Horovod
tokenize the data before the training (and remove the tokenization from the configuration)

I’ll open an issue in OpenNMT-tf repository since clearly there is something to improve here.

panosk · February 26, 2021, 5:58pm

Hi @francoishernandez ,

This could be a factor, but I think not in my case --the machines I’m using are pretty beefy with Xeon processors, nvme storage, plenty of RAM.

@guillaumekln , I’ll keep using Horovod, but it’s nice we found the reason, thanks!

guillaumekln · October 11, 2021, 2:20pm

For reference, the performance issue related to the on-the-fly OpenNMT tokenization is fixed (or largely improved) with this change:

dmarin · October 29, 2021, 5:26pm

Hi everyone,

Not sure if this thread would be the most appropriate for this post, but I’m experiencing a very similar issue, just without online tokenisation.

We have recently upgraded to new machines with 4 Quadro RTX 6000 instead of 2, but the performance barely increased 2000 tokens/sec with the same data and the same model configurations (just with num_gpus=4). More concretely, with a batch size of 4096 tokens, I get around 29k tokens/sec with 2 GPUs, while with 4 GPUs I get 31k tokens/s.

As @panosk, I’m also training a TransformerBig model with shared embeddings.

It seems GPUs are well installed and recognised:

Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
     0 1 2 3 
0:   N Y Y Y 
1:   Y N Y Y 
2:   Y Y N Y 
3:   Y Y Y N

However, GPUs seem to run only half way. Looking at the power consumption, I can see this (it’s the same for all GPUs): 132W / 260W

| 59%   77C    P2   132W / 260W |  23410MiB / 24220MiB |     66%      Default |

while in a 2-GPU machine, with the same data and configuration, values are around 231W / 260W and the volatile GPU utilization is also higher (around 90-95%):

| 53%   74C    P2   231W / 260W |  23450MiB / 24220MiB |     93%      Default |

The 4-GPU machines have the same specifications as the 2-GPU ones. Versions are the following:

OpenNMT-tf 2.22.0
TensorFlow 2.4.3 + CUDA 11.0

Do you know anything I could try to pinpoint the cause of this issue?

guillaumekln · October 30, 2021, 8:16am

Hi,

There are multiple things you can try and see how it changes your performance numbers:

Increase the batch size
Enable mixed precision
Run multi GPU training with Horovod

For completeness, can you also post your training configuration?

dmarin · November 3, 2021, 10:09am

Thanks a lot for the hints, Guillaume.

Regarding the batch size, I usually use 4096 tokens (as with my older machines). If I increase it, I get OOM errors. If I decrease it, performance is impacted accordingly, so nothing out of the ordinary here, I think.

Interestingly, I get an error if I try to autotune the batch size. I thought this was related to my machine/configuration, so I didn’t create a GitHub issue, but let me know if it would be useful. The error is as follows:

INFO:tensorflow:... failed.
INFO:tensorflow:Trying training with batch size 4287...
INFO:tensorflow:... failed.

[...]

ERROR:tensorflow:Last training attempt exited with an error:

Traceback (most recent call last):
load_model_module

[...]

    raise ValueError("Model configuration not found in %s" % path)
ValueError: Model configuration not found in [local_path]/run/model_description.py

Regarding mixed precision, I have it always enabled. When you shared your last benchmarks, I was able reproduce the figures with and without mixed precision in my older machines (2 GPUs). In my new machines, mixed precision seems to work fine too. If I disable it, performance is reduced signficantly (40% more or less). The related check seems OK:

INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0

Regarding horovod, my implementation/environment makes a bit difficult to use it, so I would prefer not to. Actually, I tried it out when this post was originally published, but I eventually discarded it, as performance was fine for me without using it…

Regarding the training configuration, the trainer is run with these values:

trainer:
  architecture: TransformerBigSharedEmbeddings
  mixed_precision: true
  num_gpus: 4

And the config (which works fine with 2 GPUs):

train:
  average_last_checkpoints: 8
  batch_size: 4096
  batch_type: tokens
  effective_batch_size: 25000
  keep_checkpoint_max: 8
  length_bucket_width: 1
  max_step: 200000
  maximum_features_length: 256
  maximum_labels_length: 256
  mixed_precision: true
  moving_average_decay: 0.9999
  replace_unknown_target: true
  sample_buffer_size: 500000
  save_checkpoints_steps: 1000
  save_summary_steps: 200
  single_pass: false

Double-checking this config, I just noticed that “mixed_precision” is also passed here, but I don’t know why. Anyway, it is passed to the trainer, so it should work nevertheless…

Note: I have also tried the last TensorFlow patch (2.4.4.), but no changes.

Update: I had originally written that mixed precision was not being applied, but it was. I corrected this above too.