How to configure Tensorflow config and optimize memory usage

JptoEn · December 21, 2021, 2:50pm

Hi I’m trying to improve my project by training on a larger vocabulary (100k) with the TransformerBigRelative, but I get an OOM error even with a 16GB P100.

It suggests adding TF_GPU_ALLOCATOR=cuda_malloc_async to the environment variables, so I tried that in Colab as:

import os
os.environ[‘TF_GPU_ALLOCATOR’] = ‘cuda_malloc_async’
print(os.getenv(‘TF_GPU_ALLOCATOR’))

I also tried:
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices(‘GPU’)
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)

But it still seems to throw the same error message.

I see batch size is configured as 3072 tokens, I guess assuming multi-GPU? I will try reducing it, but maybe it is better to use for example 32 samples instead, so it doesn’t cut off in the middle of a sentence?

guillaumekln · December 21, 2021, 3:04pm

Hi,

You should probably reduce the batch size.

You can try the automatic batch size selection by setting batch_size: 0 in the configuration. The training will try to find a value for you.

This is the batch size for a single GPU. In multi-GPU training, each GPU gets a batch size of 3072 tokens with this configuration.

Sentences are never cut off when preparing batches. Token-based batch size is usually better for the hardware since the total size of a batch (num. samples * length) is constant.

JptoEn · December 21, 2021, 3:11pm

Thanks for the reply, automatic batch size seems to be around 1500.

I just noticed that unless you force GPU growth as an environment variable TF overrides it. “Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0”

Do you think there is a significant relationship between batch size and model performance in the context of NMT?

guillaumekln · December 21, 2021, 3:17pm

The batch size has an impact on model performance, especially for Transformer models. However, larger batch sizes can always be emulated with gradient accumulation.

See the training parameter effective_batch_size (set by default with --auto_config).

JptoEn · December 21, 2021, 3:22pm

Oh I see that is very interesting.

JptoEn · December 21, 2021, 4:16pm

Just to notify you that I get this warning when trying to use asynchronous malloc:

“TF_GPU_ALLOCATOR=cuda_malloc_async environment found, but TensorFlow was not compiled with CUDA 11.2+.”