It seems some little magic happened with the latest version of OpenNMT-tf (v2.17.0) and I notice a ~20% throughput increase during training! Is it because of the switch to Keras mixed precision? Well done!
Yes, this is mainly related to the mixed precision update. We also revised the loss definition to avoid rescaling the gradients later. This improves the performance a bit more.
Thanks for testing! Let me know if you find any issues related to these changes.
Hi @guillaumekln ,
Now I’m facing some OOM issues with a new training. It seems I cannot affect the amount of the GPU memory used even when I reduce the
batch_size and/or the
In the previous successful training, I had left the
batch_size to 2000 without defining
sample_buffer_size. I had noticed that the shuffle buffer used a higher value than it used to (from 5000000 to ~6300000) and my GPU memory was squeezed in both GPUs, 11004 out of 11016MiB and 11008 out of 11019MiB, according to nvidia-smi. Still, training was successful with no OOM.
Now I’m training with almost the same corpus on the other language direction, but training fails at the evaluation step every time. I tried to lower the
batch_size as low as 1586, I set
sample_buffer_size to 5000000, but memory usage remains unchanged.
Both in the previous training and in the current one I’m training with mixed precision and horovod.
I did notice some related code changes you did in GitHub though, hopefully they address these issues too besides auto-tuning?
sample_buffer_size does not affect GPU memory, only CPU memory. Also, TensorFlow will always reserve all available GPU memory that is why you are not seeing any change when reducing the batch size.
Is it possible that you have a very long sentence in the evaluation file?
Indeed, I found a bug in my preprocessing pipeline that created huge lines. Thanks and sorry for the trouble.