OpenNMT-tf with Ampere GPUs

Hello,

I’m testing a server with 2 RTX 3090 GPUs using a checkpoint from an ongoing training from another machine but I’m facing some performance issues. First, the number of source and target tokens processed is a bit lower than the numbers I get in a server with 2 RTX 2080 Ti GPUs. Then, after the next checkpoint is created, each step seems to be repeated, as if it runs twice in each GPU – for example, the log looks like this:

...
INFO:tensorflow:Step = 56800 ; steps/s = 0.64, source words/s = 18922, target words/s = 15357 ; Learning rate = 0.000262 ; Loss = 2.012340
INFO:tensorflow:Step = 56800 ; steps/s = 0.64, source words/s = 18922, target words/s = 15357 ; Learning rate = 0.000262 ; Loss = 2.012340
...

Batch size auto-tunes to ~7500.

Some specs:

Ubuntu 20.04
nvidia-driver: 460.32
CUDA 11
cudnn 8.05
OpenNNT-tf 2.15
Tensorflow 2.4.1

Any clues what could be wrong?

Hi,

On Ampere GPUs TensorFlow enables the new NVIDIA computation mode called “TensorFloat-32”. It’s possible this is causing unexpected performance results.

Are you able to edit OpenNMT-tf code? Can you try disabling TensorFloat-32?

In the Runner constructor, add the following line to disable TensorFloat-32:

tf.config.experimental.enable_tensor_float_32_execution(False)

The logging issue is probably unrelated. Are you using onmt-main?

Thanks for the input @guillaumekln ,

Thing is that I train with mixed precision. Is TF32 still used then?

Yes, I’m using onmt-main, here is the full training command:

CUDA_VISIBLE_DEVICES=0,1 onmt-main --model config/CustomTF.py --auto_config --config config/train.yml --mixed_precision train --with_eval --num_gpus 2 &>> train.log &

In that case TF32 is probably not used. At least this is my expectation, I did not try Ampere GPUs yet.

Are you comparing the same model on the 3090 and 2080 Ti? In particular, is it the same vocabulary size?

Yes, I’m using the exact same data files, vocabs, settings. I just continue training from a checkpoint and compare training performance with the 2080 Ti.
I also tried to use only one GPU, just in case there are sync issues, but throughput is the same…

I read somewhere that complete support for Ampere GPUs was actually added in CUDA 11.1, which unfortunately is not compatible with TensorFlow 2.4.1.

Edit: Disabling TF32 has no change in performance, so probably it is not used in mixed precision.

Did you check the GPU usage with nvidia-smi? Does it look like the GPUs are fully used?

Yes, I checked and the utilization is erratic, ranging from 99% all the way down to 0% --the cards don’t stay at 0% for long, but this is definitely not a normal utilization pattern.

It seems the TensorFlow Docker images built by NVIDIA are using CUDA 11.1: NVIDIA NGC

You could try installing and running OpenNMT-tf within the following Docker image:

nvcr.io/nvidia/tensorflow:20.12-tf2-py3

Unfortunately I can’t test the docker now as I have exchanged the server for another one with a bunch of 2080 Tis.

I think starting from 11.2, this terrible situation with CUDA incompatibilities between minor versions will stop.

Hello panosk,

I know this post is 1 year old, but I’m facing the same problem with RTX 3090. I was wondering if you figured out a solution to improve the performance?

Best regards,
Samuel

Hello,

I haven’t used an RTX 3090 since then, but maybe your issue is related to this post: Cannot scale well with multiple GPUs

Thanks for the info,

Turned out that I was not using the card efficiently. I’m fine tuning my setting to use it at 100% capacity where before I was at 30%…!

Best regards
Samuel