I’m testing a server with 2 RTX 3090 GPUs using a checkpoint from an ongoing training from another machine but I’m facing some performance issues. First, the number of source and target tokens processed is a bit lower than the numbers I get in a server with 2 RTX 2080 Ti GPUs. Then, after the next checkpoint is created, each step seems to be repeated, as if it runs twice in each GPU – for example, the log looks like this:
On Ampere GPUs TensorFlow enables the new NVIDIA computation mode called “TensorFloat-32”. It’s possible this is causing unexpected performance results.
Are you able to edit OpenNMT-tf code? Can you try disabling TensorFloat-32?
In the Runner constructor, add the following line to disable TensorFloat-32:
Yes, I’m using the exact same data files, vocabs, settings. I just continue training from a checkpoint and compare training performance with the 2080 Ti.
I also tried to use only one GPU, just in case there are sync issues, but throughput is the same…
I read somewhere that complete support for Ampere GPUs was actually added in CUDA 11.1, which unfortunately is not compatible with TensorFlow 2.4.1.
Edit: Disabling TF32 has no change in performance, so probably it is not used in mixed precision.
Yes, I checked and the utilization is erratic, ranging from 99% all the way down to 0% --the cards don’t stay at 0% for long, but this is definitely not a normal utilization pattern.
I know this post is 1 year old, but I’m facing the same problem with RTX 3090. I was wondering if you figured out a solution to improve the performance?