Horovod + onmt-main

Hello,

I’ve been trying to use horovod, but so far with no success. I’m running this command:

horovodrun -np 2 -H localhost:2 onmt-main --model_type Transformer --config ./PREP/eng-dut/test/OpenNMT/model_config.yaml --horovod --auto_config train --with_eval

and I’m getting this results:

usage: horovodrun [-h] [-v] -np NUM_PROC [-cb] [--disable-cache]

I’ve followed the example provided here:

https://opennmt.net/OpenNMT-tf/training.html?highlight=horovod#distributed-training-with-horovod

any hint on what I could be doing wrong?

Looks good to me, but maybe try to move the --horovod option at the end of your training command, after --with_eval?

I just did a try and I get the same result. I’m currently running the process in a subprocess (popen). I will try to run it directly into the jupyter command line and see if I get a different result.

Thank you for your input!

So If I run the command directly in the Jupyter notebook it’s working, but it’s failing whit this error:

Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR

1]<stderr>:2022-11-15 16:04:28.990539: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[1]<stderr>:2022-11-15 16:04:28.990788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
[1]<stderr>:pciBusID: 0000:04:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
[1]<stderr>:coreClock: 1.725GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
[1]<stderr>:2022-11-15 16:04:28.990851: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[1]<stderr>:2022-11-15 16:04:28.991095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: 
[1]<stderr>:pciBusID: 0000:0a:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
[1]<stderr>:coreClock: 1.725GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s

and

1]<stderr>:2022-11-15 16:04:28.995408: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[1]<stderr>:2022-11-15 16:04:28.995600: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[0]<stderr>:2022-11-15 16:04:28.995726: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
[1]<stderr>:2022-11-15 16:04:28.995778: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[0]<stderr>:2022-11-15 16:04:28.995855: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
[1]<stderr>:2022-11-15 16:04:28.995956: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[1]<stderr>:2022-11-15 16:04:28.996092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
[0]<stderr>:2022-11-15 16:04:28.997044: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
[0]<stderr>:2022-11-15 16:04:28.997317: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
[0]<stderr>:2022-11-15 16:04:28.997394: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
[0]<stderr>:2022-11-15 16:04:28.997447: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[0]<stderr>:2022-11-15 16:04:28.997614: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[0]<stderr>:2022-11-15 16:04:28.997770: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[0]<stderr>:2022-11-15 16:04:28.997928: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[0]<stderr>:2022-11-15 16:04:28.998050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
[1]<stderr>:2022-11-15 16:04:29.010274: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[0]<stderr>:2022-11-15 16:04:29.011479: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[0]<stderr>:2022-11-15 16:04:29.241000: I main.py:314] Using OpenNMT-tf version 2.21.0

and

1]<stderr>:2022-11-15 16:04:29.245470: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
[1]<stderr>:To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[1]<stderr>:2022-11-15 16:04:29.245821: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
[1]<stderr>:2022-11-15 16:04:29.245948: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[1]<stderr>:2022-11-15 16:04:29.246824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
[1]<stderr>:pciBusID: 0000:0a:00.0 name: NVIDIA GeForce RTX 3090 computeCapability: 8.6
[1]<stderr>:coreClock: 1.725GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
[1]<stderr>:2022-11-15 16:04:29.246840: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1]<stderr>:2022-11-15 16:04:29.246862: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
[1]<stderr>:2022-11-15 16:04:29.246871: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
[1]<stderr>:2022-11-15 16:04:29.246878: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
[1]<stderr>:2022-11-15 16:04:29.246884: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
[1]<stderr>:2022-11-15 16:04:29.246891: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
[1]<stderr>:2022-11-15 16:04:29.246897: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
[1]<stderr>:2022-11-15 16:04:29.246904: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
[1]<stderr>:2022-11-15 16:04:29.246965: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[1]<stderr>:2022-11-15 16:04:29.247180: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[1]<stderr>:2022-11-15 16:04:29.247450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 1
[1]<stderr>:2022-11-15 16:04:29.707475: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
[1]<stderr>:2022-11-15 16:04:29.707512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      1 
[1]<stderr>:2022-11-15 16:04:29.707520: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 1:   N 
[1]<stderr>:2022-11-15 16:04:29.707692: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[1]<stderr>:2022-11-15 16:04:29.707854: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[1]<stderr>:2022-11-15 16:04:29.707989: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[1]<stderr>:2022-11-15 16:04:29.708104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22378 MB memory) -> physical GPU (device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:0a:00.0, compute capability: 8.6)
[1]<stderr>:2022-11-15 16:04:29.708000: I main.py:323] Searching the largest batch size between 256 and 16384 with a precision of 256...
[1]<stderr>:2022-11-15 16:04:29.712000: I main.py:323] Trying training with batch size 8320...
[0]<stderr>:2022-11-15 16:04:29.730524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
[0]<stderr>:2022-11-15 16:04:29.730547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
[0]<stderr>:2022-11-15 16:04:29.730556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
[0]<stderr>:2022-11-15 16:04:29.730739: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[0]<stderr>:2022-11-15 16:04:29.730914: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[0]<stderr>:2022-11-15 16:04:29.731065: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[0]<stderr>:2022-11-15 16:04:29.731194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22378 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:04:00.0, compute capability: 8.6)
[0]<stderr>:2022-11-15 16:04:29.731000: I main.py:323] Searching the largest batch size between 256 and 16384 with a precision of 256...
[0]<stderr>:2022-11-15 16:04:29.735000: I main.py:323] Trying training with batch size 8320...
[1]<stderr>:2022-11-15 16:04:50.086000: I main.py:323] ... failed.
[1]<stderr>:2022-11-15 16:04:50.092000: I main.py:323] Trying training with batch size 4287...
[0]<stderr>:2022-11-15 16:04:50.232000: I main.py:323] ... failed.
[0]<stderr>:2022-11-15 16:04:50.239000: I main.py:323] Trying training with batch size 4287...
[0]<stderr>:2022-11-15 16:05:10.519000: I main.py:323] ... failed.
[0]<stderr>:2022-11-15 16:05:10.524000: I main.py:323] Trying training with batch size 2271...
[1]<stderr>:2022-11-15 16:05:10.540000: I main.py:323] ... failed.
[1]<stderr>:2022-11-15 16:05:10.547000: I main.py:323] Trying training with batch size 2271...
[1]<stderr>:2022-11-15 16:05:30.812000: I main.py:323] ... failed.
[1]<stderr>:2022-11-15 16:05:30.817000: I main.py:323] Trying training with batch size 1263...
[0]<stderr>:2022-11-15 16:05:31.233000: I main.py:323] ... failed.
[0]<stderr>:2022-11-15 16:05:31.238000: I main.py:323] Trying training with batch size 1263...
[0]<stderr>:2022-11-15 16:05:51.488000: I main.py:323] ... failed.
[0]<stderr>:2022-11-15 16:05:51.493000: I main.py:323] Trying training with batch size 759...
[1]<stderr>:2022-11-15 16:05:51.757000: I main.py:323] ... failed.
[1]<stderr>:2022-11-15 16:05:51.762000: I main.py:323] Trying training with batch size 759...
[0]<stderr>:2022-11-15 16:06:12.087000: I main.py:323] ... failed.
[0]<stderr>:2022-11-15 16:06:12.092000: I main.py:323] Trying training with batch size 507...
[1]<stderr>:2022-11-15 16:06:12.462000: I main.py:323] ... failed.
[1]<stderr>:2022-11-15 16:06:12.468000: I main.py:323] Trying training with batch size 507...
[0]<stderr>:2022-11-15 16:06:32.391000: I main.py:323] ... failed.
[0]<stderr>:2022-11-15 16:06:32.392000: E main.py:323] Last training attempt exited with an error:
[0]<stderr>:
[0]<stderr>:"""
[0]<stderr>:2022-11-15 16:06:31.120468: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED

Hi,

I’m not sure the batch size automatic search is working with Horovod. Can you try setting a fixed batch size in the configuration?

Hello,

  1. I confirm that batch size = 0 doesn’t work with Horovod.

  2. I also had an issue with my cuDNN version which was not compatible with the tensorflow and Cuda version installed.

  3. i’m still not able to use the “–mixed_precision” flag… i’m getting this error:

onmt-main: error: unrecognized arguments: --mixed_precision

my commend line:

horovodrun -np 2 -H localhost:2 onmt-main --model_type Transformer --config ./PREP/eng-dut/test/OpenNMT/model_config.yaml --auto_config train --with_eval --horovod --mixed_precision

Thank you for your help! it’s much appreciated.

--mixed_precision must be set before train.

You can checkout the --help output to know what are general options and what are training specific options.

onmt-main --help
onmt-main train --help