Cannot scale well with multiple GPUs

I tried your suggestion but I see no improvement. Although at the first step there is a difference and things look promising, after that performance decreases again:

INFO:tensorflow:Step = 105100 ; steps/s = 1.73, source words/s = 57549, target words/s = 46226 ; Learning rate = 0.000193 ; Loss = 2.362551
INFO:tensorflow:Step = 105200 ; steps/s = 0.84, source words/s = 27941, target words/s = 21942 ; Learning rate = 0.000193 ; Loss = 2.492122
INFO:tensorflow:Step = 105300 ; steps/s = 0.46, source words/s = 15349, target words/s = 11928 ; Learning rate = 0.000193 ; Loss = 2.420474
INFO:tensorflow:Step = 105400 ; steps/s = 0.46, source words/s = 15374, target words/s = 11964 ; Learning rate = 0.000193 ; Loss = 2.493598
INFO:tensorflow:Step = 105500 ; steps/s = 0.46, source words/s = 15275, target words/s = 11868 ; Learning rate = 0.000192 ; Loss = 2.374078

Could this be a CPU or I/O bottleneck? We’ve had the case when trying to apply too many operations on the fly, with an underpowered CPU compared to the number of GPUs for instance. (on -py though, not sure about -tf)
May not be applicable here but thought it was worth mentioning.

@francoishernandez It’s possible this also plays a role, but I don’t think it is the main issue here because @panosk is able to get good performance when the multi-GPU training is managed by Horovod.

When enabling on-the-fly tokenization, the execution needs to move back and forth between the TensorFlow runtime and the Python runtime where the tokenization is applied. But TensorFlow multi-GPU uses multithreading and we know that Python code can not be parallelized with threads because of the Global Interpreter Lock. So I think all threads are stumbling over each other when running this Python tokenization and it becomes a bottleneck. On the other hand, Horovod is starting a separate process for each GPU.

So the current possible workarounds are:

  • use Horovod
  • tokenize the data before the training (and remove the tokenization from the configuration)

I’ll open an issue in OpenNMT-tf repository since clearly there is something to improve here.

2 Likes

Hi @francoishernandez ,

This could be a factor, but I think not in my case --the machines I’m using are pretty beefy with Xeon processors, nvme storage, plenty of RAM.

@guillaumekln , I’ll keep using Horovod, but it’s nice we found the reason, thanks!

For reference, the performance issue related to the on-the-fly OpenNMT tokenization is fixed (or largely improved) with this change:

1 Like

Hi everyone,

Not sure if this thread would be the most appropriate for this post, but I’m experiencing a very similar issue, just without online tokenisation.

We have recently upgraded to new machines with 4 Quadro RTX 6000 instead of 2, but the performance barely increased 2000 tokens/sec with the same data and the same model configurations (just with num_gpus=4). More concretely, with a batch size of 4096 tokens, I get around 29k tokens/sec with 2 GPUs, while with 4 GPUs I get 31k tokens/s.

As @panosk, I’m also training a TransformerBig model with shared embeddings.

It seems GPUs are well installed and recognised:

Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
     0 1 2 3 
0:   N Y Y Y 
1:   Y N Y Y 
2:   Y Y N Y 
3:   Y Y Y N 

However, GPUs seem to run only half way. Looking at the power consumption, I can see this (it’s the same for all GPUs): 132W / 260W

| 59%   77C    P2   132W / 260W |  23410MiB / 24220MiB |     66%      Default |

while in a 2-GPU machine, with the same data and configuration, values are around 231W / 260W and the volatile GPU utilization is also higher (around 90-95%):

| 53%   74C    P2   231W / 260W |  23450MiB / 24220MiB |     93%      Default | 

The 4-GPU machines have the same specifications as the 2-GPU ones. Versions are the following:

  • OpenNMT-tf 2.22.0
  • TensorFlow 2.4.3 + CUDA 11.0

Do you know anything I could try to pinpoint the cause of this issue?

Hi,

There are multiple things you can try and see how it changes your performance numbers:

  • Increase the batch size
  • Enable mixed precision
  • Run multi GPU training with Horovod

For completeness, can you also post your training configuration?

Thanks a lot for the hints, Guillaume.

Regarding the batch size, I usually use 4096 tokens (as with my older machines). If I increase it, I get OOM errors. If I decrease it, performance is impacted accordingly, so nothing out of the ordinary here, I think.

Interestingly, I get an error if I try to autotune the batch size. I thought this was related to my machine/configuration, so I didn’t create a GitHub issue, but let me know if it would be useful. The error is as follows:

INFO:tensorflow:... failed.
INFO:tensorflow:Trying training with batch size 4287...
INFO:tensorflow:... failed.

[...]

ERROR:tensorflow:Last training attempt exited with an error:

Traceback (most recent call last):
load_model_module

[...]

    raise ValueError("Model configuration not found in %s" % path)
ValueError: Model configuration not found in [local_path]/run/model_description.py

Regarding mixed precision, I have it always enabled. When you shared your last benchmarks, I was able reproduce the figures with and without mixed precision in my older machines (2 GPUs). In my new machines, mixed precision seems to work fine too. If I disable it, performance is reduced signficantly (40% more or less). The related check seems OK:

INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0

Regarding horovod, my implementation/environment makes a bit difficult to use it, so I would prefer not to. Actually, I tried it out when this post was originally published, but I eventually discarded it, as performance was fine for me without using it…

Regarding the training configuration, the trainer is run with these values:

trainer:
  architecture: TransformerBigSharedEmbeddings
  mixed_precision: true
  num_gpus: 4

And the config (which works fine with 2 GPUs):

train:
  average_last_checkpoints: 8
  batch_size: 4096
  batch_type: tokens
  effective_batch_size: 25000
  keep_checkpoint_max: 8
  length_bucket_width: 1
  max_step: 200000
  maximum_features_length: 256
  maximum_labels_length: 256
  mixed_precision: true
  moving_average_decay: 0.9999
  replace_unknown_target: true
  sample_buffer_size: 500000
  save_checkpoints_steps: 1000
  save_summary_steps: 200
  single_pass: false

Double-checking this config, I just noticed that “mixed_precision” is also passed here, but I don’t know why. Anyway, it is passed to the trainer, so it should work nevertheless…

Note: I have also tried the last TensorFlow patch (2.4.4.), but no changes.


Update: I had originally written that mixed precision was not being applied, but it was. I corrected this above too.

I checked again with OpenNMT-tf 2.22.0 + TensorFlow 2.4.3 and I can still reproduce the 4 GPU + FP16 number from the benchmark I did earlier (without tokenization):

I also tried with your exact configuration and the performance is similar (slightly lower due to moving average).

If it’s not already case, I suggest trying with the TensorFlow Docker image tensorflow/tensorflow:2.4.3-gpu to make sure this is not a library issue.

If the performance is still not improving, there may be a bottleneck on the system. For example PCIe bottleneck is something that exists but I’m not sure if it is relevant here.

Thanks a lot for reproducing the bechmark figures with those versions, Guillaume. As always, your help is very much appreciated.

Even if the issue is not solved yet for me, let me give an update and ask a couple of doubts.

After some detailed profiling, we could not find anything that explains the poor performance. Hardware specifications should be fine, especially in terms of CPUs and RAM. We will run more tests, but I suspect we won’t find anything there.

I have also run tests with TensorFlow 2.5.2 and CUDA 11.2, which is another tested build configuration according to TensorFlow, with identical results.

So, in the end, the most promising option now is to use Horovod, which is what we will try next.

In the meantime, I would like to ask something. In order to improve the performance, I tried to test with different values of num_shards and num_threads parameters in opennmt.data.training_pipeline. The way I did it was to change the dataset_fn definition in the runner.py, so that I could pass my values, like so:

dataset_fn = (lambda input_context: model.examples_inputter.make_training_dataset(                                                        
[...]
num_shards=train_config["num_shards"],                                                                                 
num_threads=train_config["num_threads"],
[...]
)

However, the tokens/sec performance remains exactly the same regardless of the values I passed in for num_shards and num_threads. Could the reason be the GIL, as you pointed out above, @guillaumekln? Or should I override these parameters in a different way?

The other thing I tried was to increase the sample_buffer_size to load the whole dataset into memory to hopefully help with a potential I/O bottleneck. Unfortunately, this also had no effect at all (it just takes some addtional seconds to load all the data at the beginning, but that’s it).

Also, I would be very curious to know if anyone is using 4 or more GPUs with 24GB of vRAM each with the performance indicated by Guillaume’s benchmarks in a local server (not AWS instances).

The way you set these parameters should work, but I don’t think they make a difference since your data is already tokenized. In that case, there is not much work when preparing a batch and it is unlikely to be a bottleneck here.

What’s your vocabulary size?

In this case, 48k.

INFO:tensorflow:Initialized source input layer:
INFO:tensorflow: - vocabulary size: 47952
INFO:tensorflow: - special tokens: BOS=no, EOS=no
INFO:tensorflow:Initialized target input layer:
INFO:tensorflow: - vocabulary size: 47952
INFO:tensorflow: - special tokens: BOS=yes, EOS=yes

Hi @dmarin, did you find out the reason for your performance issues?

Hi Guillaume,

Unfortunately not. We tried many things, and the conclusion was that, if we want to use more than 2 GPUs for training, we have to use horovod. But even with horovod, we don’t get the same performance you get in your benchmarks, and that I also get with AWS instances with 4 GPUS.

Otherwise, running two trainings in parallel with two GPUs each do work well, and allows us to make the most out of the machines.

I saw some (maybe) related issues in GitHub, so don’t hesitate if you would like me to test anything.

1 Like

Hi,

I tried to train a Transformer for a few iterations using Tensorflow 2.9.0 and OpenNMT-tf 2.27.1 and got the results in the table below (GPUs are V100). I am reporting source + target words/s.

FP32 Speedup FP16 Speedup
1 GPU 39.0k 74.0k
2 GPUs 69.5k 1.78 111k 1.50
4 GPUs 96.0k 2.46 100k 1.35
8 GPUs 78.0k 2.00 81.5k 1.10

As you can see, the speed up is as expected up to 2 GPUs but gains for subsequent increases of numbers of GPUs are only marginal, if anything.

One warning I get in the logs is:

W tensorflow/stream_executor/gpu/asm_compiler.cc:111] *** WARNING *** You are using ptxas 10.1.243, which is older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to incorrect resul
ts or invalid-address errors.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.

The command cat /usr/local/cuda/version.txt returns CUDA Version 10.1.243.

I downgraded to Tensorflow 2.3.0 and OpenNMT-tf 2.15.0 and re-performed the test for the 8GPUs case and got:

FP32 Speedup FP16 Speedup
8 GPUs 241k 6.2 261.5k 3.5

Which seems to confirm that now scaling works as expected for FP32 but not for mixed precision (FP16). Curiously, I now get the same warning of @panosk:

WARNING:tensorflow:Large unrolled loop detected. Did you mean to use a TF loop? The following ops were created after iteration 3002: (<tf.Operation 'VarIsInitializedOp_3000/resource' type=Placeholder>, <tf.Operation 'VarIsInitializ
edOp_3000' type=VarIsInitializedOp>, <tf.Operation 'LogicalAnd_3000' type=LogicalAnd>)
See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/g3doc/reference/common_errors.md#warning-large-unrolled-loop-detected
[...]

My only explanation for this is that CUDA Version 10.1.* is compatible with Tensorflow 2.3.0 but not with Tensorflow 2.9.0 (as show here) and that having compatible cuda and tensorflow versions could solve the problem (at least for the FP32 case).

I am still puzzled about the FP16 poor performance though. Any hint on that side?

Hi,

TensorFlow 2.9 requires CUDA 11.2. I’m surprised you managed to run the training at all with CUDA 10.1. I suggest re-launching the first experiment with an updated CUDA version.

You may also need to increase the work for each GPU. Try training a TransformerBig model and/or increase the batch size.

I’ve updated to CUDA 11.2 and re-tested with OpenNMT-tf 2.27.1 and Tensorflow 2.9.0 and got:

FP32 Speedup FP16 Speedup
8 GPUs 104.5k 2.7 124.5k 1.7

which is somewhat of an improvement but not what one would expect. Also, it is twice as slow as Onmt-tf 2.15.0 with Tensorflow 2.3.0 and CUDA 10.2.

I also tried with a TransformerBig architecture and the results are:

FP32 Speedup FP16 Speedup
8 GPUs 41.5k n.a. 54.5k n.a.

which seems to confirm that switching to FP16 does not bring any significant speed up.

Are you applying the tokenization on the fly?

When training on pretokenized data, the numbers reported in an earlier post are still valid. I just retested a TransformerBig training on 8 V100 with TensorFlow 2.9.1 and OpenNMT-tf 2.27.1. I’m reporting source + target tokens/s :

FP32 FP16
8 GPUs 105.3k 219.5k

Note that I ran the training on a AWS instance using the official TensorFlow Docker image. In my experience this ensures a good hardware and software configuration.

I’m not using tokenization on the fly. I tokenized all sentences before starting the training.

Using the official TensorFlow Docker image seems to be solving the problem for me. Here are my results for a TransformerBig with a batch size of 4096 and no gradient accumulation:

FP32 FP16
4 GPUs 52.4k 97.0k

Thanks for helping out @guillaumekln !

P.S. I am using TensoFlow 2.9.1, OpenNMT-tf 2.27.1, Cuda 11.2, and V100 GPUs on GCP.

1 Like

To everyone stuck with the same problem I’d recommend using horovod instead of TF-distributed. It solved the problem for me.