OpenNMT-tf with Ampere GPUs

panosk · February 17, 2021, 8:25am

Hello,

I’m testing a server with 2 RTX 3090 GPUs using a checkpoint from an ongoing training from another machine but I’m facing some performance issues. First, the number of source and target tokens processed is a bit lower than the numbers I get in a server with 2 RTX 2080 Ti GPUs. Then, after the next checkpoint is created, each step seems to be repeated, as if it runs twice in each GPU – for example, the log looks like this:

...
INFO:tensorflow:Step = 56800 ; steps/s = 0.64, source words/s = 18922, target words/s = 15357 ; Learning rate = 0.000262 ; Loss = 2.012340
INFO:tensorflow:Step = 56800 ; steps/s = 0.64, source words/s = 18922, target words/s = 15357 ; Learning rate = 0.000262 ; Loss = 2.012340
...

Batch size auto-tunes to ~7500.

Some specs:

Ubuntu 20.04
nvidia-driver: 460.32
CUDA 11
cudnn 8.05
OpenNNT-tf 2.15
Tensorflow 2.4.1

Any clues what could be wrong?

guillaumekln · February 17, 2021, 10:00am

Hi,

On Ampere GPUs TensorFlow enables the new NVIDIA computation mode called “TensorFloat-32”. It’s possible this is causing unexpected performance results.

Are you able to edit OpenNMT-tf code? Can you try disabling TensorFloat-32?

In the Runner constructor, add the following line to disable TensorFloat-32:

tf.config.experimental.enable_tensor_float_32_execution(False)

The logging issue is probably unrelated. Are you using onmt-main?

panosk · February 17, 2021, 10:06am

Thanks for the input @guillaumekln ,

Thing is that I train with mixed precision. Is TF32 still used then?

Yes, I’m using onmt-main, here is the full training command:

CUDA_VISIBLE_DEVICES=0,1 onmt-main --model config/CustomTF.py --auto_config --config config/train.yml --mixed_precision train --with_eval --num_gpus 2 &>> train.log &

guillaumekln · February 17, 2021, 10:17am

In that case TF32 is probably not used. At least this is my expectation, I did not try Ampere GPUs yet.

Are you comparing the same model on the 3090 and 2080 Ti? In particular, is it the same vocabulary size?

panosk · February 17, 2021, 10:24am

Yes, I’m using the exact same data files, vocabs, settings. I just continue training from a checkpoint and compare training performance with the 2080 Ti.
I also tried to use only one GPU, just in case there are sync issues, but throughput is the same…

I read somewhere that complete support for Ampere GPUs was actually added in CUDA 11.1, which unfortunately is not compatible with TensorFlow 2.4.1.

Edit: Disabling TF32 has no change in performance, so probably it is not used in mixed precision.

guillaumekln · February 17, 2021, 11:53am

Did you check the GPU usage with nvidia-smi? Does it look like the GPUs are fully used?

panosk · February 17, 2021, 12:08pm

Yes, I checked and the utilization is erratic, ranging from 99% all the way down to 0% --the cards don’t stay at 0% for long, but this is definitely not a normal utilization pattern.

guillaumekln · February 17, 2021, 2:35pm

It seems the TensorFlow Docker images built by NVIDIA are using CUDA 11.1: NVIDIA NGC

You could try installing and running OpenNMT-tf within the following Docker image:

nvcr.io/nvidia/tensorflow:20.12-tf2-py3

panosk · February 17, 2021, 6:08pm

Unfortunately I can’t test the docker now as I have exchanged the server for another one with a bunch of 2080 Tis.

I think starting from 11.2, this terrible situation with CUDA incompatibilities between minor versions will stop.

SamuelLacombe · April 13, 2022, 3:11am

Hello panosk,

I know this post is 1 year old, but I’m facing the same problem with RTX 3090. I was wondering if you figured out a solution to improve the performance?

Best regards,
Samuel

panosk · April 13, 2022, 8:03am

Hello,

I haven’t used an RTX 3090 since then, but maybe your issue is related to this post: Cannot scale well with multiple GPUs

SamuelLacombe · April 14, 2022, 1:46am

Thanks for the info,

Turned out that I was not using the card efficiently. I’m fine tuning my setting to use it at 100% capacity where before I was at 30%…!

Best regards
Samuel