Hello,
I’m working in @dmarin teams, and following what was discussed in this topic we are currently working on doing the training using horovod.
In summary, the linked topic was about performance issue using multiple GPUs, in our case 4 quadro RX6000, and it was suggested to run the training using horovod.
However, when runing with horovod I get a Cuda out of memory issue, and I can’t track down the origin of this error. We are using opennmt-tf API and run it from our own code. The way it is runned is the following:
def run(self):
""""""
self._logger.info(f"Training model at dir {self._training_paths.base_dir}")
self._create_requirements_file()
self._initialize_summary()
runner = Runner(
self._model, self._get_opennmt_config(), auto_config=True, mixed_precision=self._mixed_precision, seed=42
)
self._logger.info(f"Training with {self._num_devices} devices.")
gpus = tf.config.experimental.list_physical_devices("GPU")
hvd.init()
if gpus:
tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
for device in gpus:
tf.config.experimental.set_memory_growth(device, enable=True)
final_model_dir, train_summary = runner.train(
num_devices=1, with_eval=True, return_summary=True, hvd = hvd
)
self._logger.info(f"Final model saved under {final_model_dir}.")
self._logger.info(f"Train summary: {final_model_dir, train_summary}.")
self._update_summary(train_summary)
It is not stated in the documentation, but the comment in the runner.train code says that num_device should be equal to one when running with horovod.
Finally I’m running horovod with this command:
horovodrun --log-level DEBUG -np 4 -H localhost:4 python notebooks/pipeline/train.py > out.log 2>error.log
Here are the logs:
I can’t post the output log file… I’m limited to two link per post, and I can’t past it here because that makes too many carachters…
If I need to add more details, just tell me
Thank you.