CUDA Out-of-memory issue with horovod

Hello,

I’m working in @dmarin teams, and following what was discussed in this topic we are currently working on doing the training using horovod.

In summary, the linked topic was about performance issue using multiple GPUs, in our case 4 quadro RX6000, and it was suggested to run the training using horovod.

However, when runing with horovod I get a Cuda out of memory issue, and I can’t track down the origin of this error. We are using opennmt-tf API and run it from our own code. The way it is runned is the following:

def run(self):
        """"""
        self._logger.info(f"Training model at dir {self._training_paths.base_dir}")
        self._create_requirements_file()
        self._initialize_summary()
        runner = Runner(
            self._model, self._get_opennmt_config(), auto_config=True, mixed_precision=self._mixed_precision, seed=42
        )
        self._logger.info(f"Training with {self._num_devices} devices.")

        
        gpus = tf.config.experimental.list_physical_devices("GPU")
        hvd.init()
        if gpus:
            tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
        for device in gpus:
            tf.config.experimental.set_memory_growth(device, enable=True)
        final_model_dir, train_summary = runner.train(
            num_devices=1, with_eval=True, return_summary=True, hvd = hvd
        )
        self._logger.info(f"Final model saved under {final_model_dir}.")
        self._logger.info(f"Train summary: {final_model_dir, train_summary}.")
        self._update_summary(train_summary)

It is not stated in the documentation, but the comment in the runner.train code says that num_device should be equal to one when running with horovod.

Finally I’m running horovod with this command:

horovodrun --log-level DEBUG -np 4 -H localhost:4 python notebooks/pipeline/train.py  > out.log 2>error.log

Here are the logs:

error.log - Pastebin.com

I can’t post the output log file… I’m limited to two link per post, and I can’t past it here because that makes too many carachters…

If I need to add more details, just tell me :slight_smile:

Thank you.

Hello,

It seems the processes do not correctly set their visible GPU. According to the logs, each process can see all available GPUs. For example here’s the log from process #3:

[3]<stderr>:2021-11-24 09:32:41.501227: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[3]<stderr>:2021-11-24 09:32:41.501238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 1 2 3 
[3]<stderr>:2021-11-24 09:32:41.501244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N Y Y Y 
[3]<stderr>:2021-11-24 09:32:41.501248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1:   Y N Y Y 
[3]<stderr>:2021-11-24 09:32:41.501252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 2:   Y Y N Y 
[3]<stderr>:2021-11-24 09:32:41.501255: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 3:   Y Y Y N 

Can you check the following line is correctly executed?

tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

You may need to run it earlier in the code, before building the model and the runner.

Here’s the expected behavior on a system with 2 GPUs:

  • without restricted GPU visibility (2 GPUs appear in device matrix):
>>> import tensorflow as tf
>>> tf.zeros([2])
...
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1:   N N
....
  • with restricted GPU visibility (only 1 GPU appears in the matrix):
>>> import tensorflow as tf
>>> devices = tf.config.list_physical_devices("GPU")
>>> tf.config.set_visible_devices(devices[0], "GPU")
>>> tf.zeros([2])
...
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
....

Thank you for the reply. That’s what I was thinking and you are rights, after running the set_visible_devices earlier in the code it changes. But I have a new error that I believe to be still linked to that because of the error [3]<stdout>:tensorflow.python.framework.errors_impl.AlreadyExistsError: TensorFlow device (GPU:0) is being mapped to multiple devices (3 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see https://github.com/tensorflow/tensorflow/issues/19083

The full log:
out.log - Pastebin.com
error.log - Pastebin.com
But at least I don’t get the OOM error anymore.

Possibly related, the following line is incorrect:

for device in gpus:
    tf.config.experimental.set_memory_growth(device, enable=True)

It should be:

tf.config.experimental.set_memory_growth(gpus[hvd.local_rank()], enable=True)

Can you try with this change?

I have changed my code to this,

gpus = tf.config.experimental.list_physical_devices("GPU")
hvd.init()
print(f'Horovod local rank : {hvd.local_rank()}')
if gpus:
    local_gpu = gpus[hvd.local_rank()]
    print(f'Setting gpu visibility {local_gpu}')
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
    tf.config.experimental.set_memory_growth(gpus[hvd.local_rank()], enable=True)

But I still have the same error:

[3]<stderr>:2021-11-24 12:53:05.713083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
[3]<stderr>:pciBusID: 0000:dc:00.0 name: Quadro RTX 6000 computeCapability: 7.5
[3]<stderr>:coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 23.65GiB deviceMemoryBandwidth: 625.94GiB/s
[3]<stderr>:2021-11-24 12:53:05.713965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 3
[3]<stderr>:2021-11-24 12:53:05.713994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
[3]<stderr>:2021-11-24 12:53:05.713999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      3 
[3]<stderr>:2021-11-24 12:53:05.714004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 3:   N 
[3]<stderr>:Traceback (most recent call last):
[3]<stderr>:  File "notebooks/pipeline/train.py", line 34, in <module>
[3]<stderr>:    Trainer(**config['train']['trainer']).run(hvd = hvd)
[3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/cdtnice/common/pipeline.py", line 34, in wrapped_f
[3]<stderr>:    raise e
[3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/cdtnice/common/pipeline.py", line 30, in wrapped_f
[3]<stderr>:    function_return_value = f(*args, **kwargs)
[3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/cdtnice/train/train.py", line 51, in run
[3]<stderr>:    final_model_dir, train_summary = runner.train(
[3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/opennmt/runner.py", line 199, in train
[3]<stderr>:    devices = misc.get_devices(count=num_devices, fallback_to_cpu=fallback_to_cpu)
[3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/opennmt/utils/misc.py", line 33, in get_devices
[3]<stderr>:    devices = tf.config.list_logical_devices(device_type=device_type)
[3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/tensorflow/python/framework/config.py", line 452, in list_logical_devices
[3]<stderr>:    return context.context().list_logical_devices(device_type=device_type)
[3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/tensorflow/python/eager/context.py", line 1395, in list_logical_devices
[3]<stderr>:    self.ensure_initialized()
[3]<stderr>:  File "/opt/mt/miniconda3/envs/horovod/lib/python3.8/site-packages/tensorflow/python/eager/context.py", line 525, in ensure_initialized
[3]<stderr>:    context_handle = pywrap_tfe.TFE_NewContext(opts)
[3]<stderr>:tensorflow.python.framework.errors_impl.AlreadyExistsError: TensorFlow device (GPU:0) is being mapped to multiple devices (3 now, and 0 previously), which is not supported. This may be the result of providing different GPU configurations (ConfigProto.gpu_options, for example different visible_device_list) when creating multiple Sessions in the same process. This is not currently supported, see https://github.com/tensorflow/tensorflow/issues/19083

I can’t find a way to reproduce this error. I tried to copy your last code snippet and then call tf.config.list_logical_devices which seems to be the sequence of events leading to the error.

Can you try to isolate and share a complete code snippet to reproduce this error?

FYI, I just tried to start a Horovod training with onmt-main and it is running fine.

I will try to isolate what cause the error and update you once I have managed to do it.

Hello, sorry for the delay but our training machine was down. While trying to create a minimum reproducible example, it actually worked and this allowed me to find out what was not working.

It was due to the following method:

from tensorflow.python.client import device_lib
...
def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == "GPU"]

Used as

len(get_available_gpus())

That would cause the process to crash and remove this call makes it work.

So, thank you for your precious help.

1 Like