OpenNMT

How to configure Tensorflow config and optimize memory usage

Hi I’m trying to improve my project by training on a larger vocabulary (100k) with the TransformerBigRelative, but I get an OOM error even with a 16GB P100.

It suggests adding TF_GPU_ALLOCATOR=cuda_malloc_async to the environment variables, so I tried that in Colab as:

import os
os.environ[‘TF_GPU_ALLOCATOR’] = ‘cuda_malloc_async’
print(os.getenv(‘TF_GPU_ALLOCATOR’))

I also tried:
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices(‘GPU’)
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)

But it still seems to throw the same error message.

I see batch size is configured as 3072 tokens, I guess assuming multi-GPU? I will try reducing it, but maybe it is better to use for example 32 samples instead, so it doesn’t cut off in the middle of a sentence?

Hi,

You should probably reduce the batch size.

You can try the automatic batch size selection by setting batch_size: 0 in the configuration. The training will try to find a value for you.

This is the batch size for a single GPU. In multi-GPU training, each GPU gets a batch size of 3072 tokens with this configuration.

Sentences are never cut off when preparing batches. Token-based batch size is usually better for the hardware since the total size of a batch (num. samples * length) is constant.

Thanks for the reply, automatic batch size seems to be around 1500.

I just noticed that unless you force GPU growth as an environment variable TF overrides it. “Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0”

Do you think there is a significant relationship between batch size and model performance in the context of NMT?

The batch size has an impact on model performance, especially for Transformer models. However, larger batch sizes can always be emulated with gradient accumulation.

See the training parameter effective_batch_size (set by default with --auto_config).

Oh I see that is very interesting.

Just to notify you that I get this warning when trying to use asynchronous malloc:

“TF_GPU_ALLOCATOR=cuda_malloc_async environment found, but TensorFlow was not compiled with CUDA 11.2+.”