I’m trying to use 8 RTX 2080 Ti GPUs but I don’t see significant performance improvements compared to 2 GPUs.
The problems start early on, when I set
batch_size to 0 so I can get an auto-tuned value. The process takes way too long to complete a first try with 8704 batch size, then a second try with 4863, and then it simply gets stuck with the next batch size and I have to kill the process.
So I’m setting the batch size myself and training starts, but performance is bad. With 2 GPUs I get ~20000 source tokens/sec and ~16000 target tokens/sec, with 8 GPUs I just get +2000 tokens.
I also get this warning and a traceback when using all 8 GPUs:
WARNING:tensorflow:Large unrolled loop detected. Did you mean to use a TF loop? The following ops were created after iteration 3002: (<tf.Operation 'VarIsInitializedOp_3000 /resource' type=Placeholder>, <tf.Operation 'VarIsInitializedOp_3000' type=VarIsInitializedOp>, <tf.Operation 'LogicalAnd_3000' type=LogicalAnd>)
GPU utilization is medium, ~60%. I’ve tried various combinations of software versions but results are the same:
tensorflow 2.3.2 + CUDA 10.1 + cudnn 7.6.5
tensorflow 2.4.1 + CUDA 11.0 + cudnn 8.0.4