Cannot scale well with multiple GPUs

dmarin · November 3, 2021, 10:09am

Thanks a lot for the hints, Guillaume.

Regarding the batch size, I usually use 4096 tokens (as with my older machines). If I increase it, I get OOM errors. If I decrease it, performance is impacted accordingly, so nothing out of the ordinary here, I think.

Interestingly, I get an error if I try to autotune the batch size. I thought this was related to my machine/configuration, so I didn’t create a GitHub issue, but let me know if it would be useful. The error is as follows:

INFO:tensorflow:... failed.
INFO:tensorflow:Trying training with batch size 4287...
INFO:tensorflow:... failed.

[...]

ERROR:tensorflow:Last training attempt exited with an error:

Traceback (most recent call last):
load_model_module

[...]

    raise ValueError("Model configuration not found in %s" % path)
ValueError: Model configuration not found in [local_path]/run/model_description.py

Regarding mixed precision, I have it always enabled. When you shared your last benchmarks, I was able reproduce the figures with and without mixed precision in my older machines (2 GPUs). In my new machines, mixed precision seems to work fine too. If I disable it, performance is reduced signficantly (40% more or less). The related check seems OK:

INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0

Regarding horovod, my implementation/environment makes a bit difficult to use it, so I would prefer not to. Actually, I tried it out when this post was originally published, but I eventually discarded it, as performance was fine for me without using it…

Regarding the training configuration, the trainer is run with these values:

trainer:
  architecture: TransformerBigSharedEmbeddings
  mixed_precision: true
  num_gpus: 4

And the config (which works fine with 2 GPUs):

train:
  average_last_checkpoints: 8
  batch_size: 4096
  batch_type: tokens
  effective_batch_size: 25000
  keep_checkpoint_max: 8
  length_bucket_width: 1
  max_step: 200000
  maximum_features_length: 256
  maximum_labels_length: 256
  mixed_precision: true
  moving_average_decay: 0.9999
  replace_unknown_target: true
  sample_buffer_size: 500000
  save_checkpoints_steps: 1000
  save_summary_steps: 200
  single_pass: false

Double-checking this config, I just noticed that “mixed_precision” is also passed here, but I don’t know why. Anyway, it is passed to the trainer, so it should work nevertheless…

Note: I have also tried the last TensorFlow patch (2.4.4.), but no changes.

Update: I had originally written that mixed precision was not being applied, but it was. I corrected this above too.

guillaumekln · November 3, 2021, 11:22am

I checked again with OpenNMT-tf 2.22.0 + TensorFlow 2.4.3 and I can still reproduce the 4 GPU + FP16 number from the benchmark I did earlier (without tokenization):

I also tried with your exact configuration and the performance is similar (slightly lower due to moving average).

If it’s not already case, I suggest trying with the TensorFlow Docker image tensorflow/tensorflow:2.4.3-gpu to make sure this is not a library issue.

If the performance is still not improving, there may be a bottleneck on the system. For example PCIe bottleneck is something that exists but I’m not sure if it is relevant here.

dmarin · November 9, 2021, 3:35pm

Thanks a lot for reproducing the bechmark figures with those versions, Guillaume. As always, your help is very much appreciated.

Even if the issue is not solved yet for me, let me give an update and ask a couple of doubts.

After some detailed profiling, we could not find anything that explains the poor performance. Hardware specifications should be fine, especially in terms of CPUs and RAM. We will run more tests, but I suspect we won’t find anything there.

I have also run tests with TensorFlow 2.5.2 and CUDA 11.2, which is another tested build configuration according to TensorFlow, with identical results.

So, in the end, the most promising option now is to use Horovod, which is what we will try next.

In the meantime, I would like to ask something. In order to improve the performance, I tried to test with different values of num_shards and num_threads parameters in opennmt.data.training_pipeline. The way I did it was to change the dataset_fn definition in the runner.py, so that I could pass my values, like so:

dataset_fn = (lambda input_context: model.examples_inputter.make_training_dataset(                                                        
[...]
num_shards=train_config["num_shards"],                                                                                 
num_threads=train_config["num_threads"],
[...]
)

However, the tokens/sec performance remains exactly the same regardless of the values I passed in for num_shards and num_threads. Could the reason be the GIL, as you pointed out above, @guillaumekln? Or should I override these parameters in a different way?

The other thing I tried was to increase the sample_buffer_size to load the whole dataset into memory to hopefully help with a potential I/O bottleneck. Unfortunately, this also had no effect at all (it just takes some addtional seconds to load all the data at the beginning, but that’s it).

Also, I would be very curious to know if anyone is using 4 or more GPUs with 24GB of vRAM each with the performance indicated by Guillaume’s benchmarks in a local server (not AWS instances).

guillaumekln · November 9, 2021, 4:15pm

The way you set these parameters should work, but I don’t think they make a difference since your data is already tokenized. In that case, there is not much work when preparing a batch and it is unlikely to be a bottleneck here.

What’s your vocabulary size?

dmarin · November 9, 2021, 5:41pm

In this case, 48k.

INFO:tensorflow:Initialized source input layer:
INFO:tensorflow: - vocabulary size: 47952
INFO:tensorflow: - special tokens: BOS=no, EOS=no
INFO:tensorflow:Initialized target input layer:
INFO:tensorflow: - vocabulary size: 47952
INFO:tensorflow: - special tokens: BOS=yes, EOS=yes

guillaumekln · February 10, 2022, 4:58pm

Hi @dmarin, did you find out the reason for your performance issues?

dmarin · February 10, 2022, 5:32pm

Hi Guillaume,

Unfortunately not. We tried many things, and the conclusion was that, if we want to use more than 2 GPUs for training, we have to use horovod. But even with horovod, we don’t get the same performance you get in your benchmarks, and that I also get with AWS instances with 4 GPUS.

Otherwise, running two trainings in parallel with two GPUs each do work well, and allows us to make the most out of the machines.

I saw some (maybe) related issues in GitHub, so don’t hesitate if you would like me to test anything.

ZenBel · June 23, 2022, 12:11pm

Hi,

I tried to train a Transformer for a few iterations using Tensorflow 2.9.0 and OpenNMT-tf 2.27.1 and got the results in the table below (GPUs are V100). I am reporting source + target words/s.

	FP32	Speedup	FP16	Speedup
1 GPU	39.0k		74.0k
2 GPUs	69.5k	1.78	111k	1.50
4 GPUs	96.0k	2.46	100k	1.35
8 GPUs	78.0k	2.00	81.5k	1.10

As you can see, the speed up is as expected up to 2 GPUs but gains for subsequent increases of numbers of GPUs are only marginal, if anything.

One warning I get in the logs is:

W tensorflow/stream_executor/gpu/asm_compiler.cc:111] *** WARNING *** You are using ptxas 10.1.243, which is older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to incorrect resul
ts or invalid-address errors.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is often sufficient.

The command cat /usr/local/cuda/version.txt returns CUDA Version 10.1.243.

I downgraded to Tensorflow 2.3.0 and OpenNMT-tf 2.15.0 and re-performed the test for the 8GPUs case and got:

	FP32	Speedup	FP16	Speedup
8 GPUs	241k	6.2	261.5k	3.5

Which seems to confirm that now scaling works as expected for FP32 but not for mixed precision (FP16). Curiously, I now get the same warning of @panosk:

WARNING:tensorflow:Large unrolled loop detected. Did you mean to use a TF loop? The following ops were created after iteration 3002: (<tf.Operation 'VarIsInitializedOp_3000/resource' type=Placeholder>, <tf.Operation 'VarIsInitializ
edOp_3000' type=VarIsInitializedOp>, <tf.Operation 'LogicalAnd_3000' type=LogicalAnd>)
See https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/autograph/g3doc/reference/common_errors.md#warning-large-unrolled-loop-detected
[...]

My only explanation for this is that CUDA Version 10.1.* is compatible with Tensorflow 2.3.0 but not with Tensorflow 2.9.0 (as show here) and that having compatible cuda and tensorflow versions could solve the problem (at least for the FP32 case).

I am still puzzled about the FP16 poor performance though. Any hint on that side?

guillaumekln · June 23, 2022, 12:27pm

Hi,

TensorFlow 2.9 requires CUDA 11.2. I’m surprised you managed to run the training at all with CUDA 10.1. I suggest re-launching the first experiment with an updated CUDA version.

You may also need to increase the work for each GPU. Try training a TransformerBig model and/or increase the batch size.

ZenBel · June 23, 2022, 3:12pm

I’ve updated to CUDA 11.2 and re-tested with OpenNMT-tf 2.27.1 and Tensorflow 2.9.0 and got:

	FP32	Speedup	FP16	Speedup
8 GPUs	104.5k	2.7	124.5k	1.7

which is somewhat of an improvement but not what one would expect. Also, it is twice as slow as Onmt-tf 2.15.0 with Tensorflow 2.3.0 and CUDA 10.2.

I also tried with a TransformerBig architecture and the results are:

	FP32	Speedup	FP16	Speedup
8 GPUs	41.5k	n.a.	54.5k	n.a.

which seems to confirm that switching to FP16 does not bring any significant speed up.

guillaumekln · June 23, 2022, 4:02pm

Are you applying the tokenization on the fly?

When training on pretokenized data, the numbers reported in an earlier post are still valid. I just retested a TransformerBig training on 8 V100 with TensorFlow 2.9.1 and OpenNMT-tf 2.27.1. I’m reporting source + target tokens/s :

	FP32	FP16
8 GPUs	105.3k	219.5k

Note that I ran the training on a AWS instance using the official TensorFlow Docker image. In my experience this ensures a good hardware and software configuration.

ZenBel · June 27, 2022, 11:46am

I’m not using tokenization on the fly. I tokenized all sentences before starting the training.

Using the official TensorFlow Docker image seems to be solving the problem for me. Here are my results for a TransformerBig with a batch size of 4096 and no gradient accumulation:

	FP32	FP16
4 GPUs	52.4k	97.0k

Thanks for helping out @guillaumekln !

P.S. I am using TensoFlow 2.9.1, OpenNMT-tf 2.27.1, Cuda 11.2, and V100 GPUs on GCP.

AlexUmnov · November 8, 2022, 10:50am

To everyone stuck with the same problem I’d recommend using horovod instead of TF-distributed. It solved the problem for me.