Multi-node distributed training

szhang42 · August 3, 2021, 1:45am

Hello,

I am working on the multi-node training for the Open NMT. I have two 4-GPU devices. If I want to train the Open NMT on these two nodes (each node with 4 GPU). Do I just set the world size as 8 and gpu_ranks as [0, 1, 2, 3, 4, 5, 6, 7]. Are there any other places I need to change for training on two nodes with 4 GPUs each? In addition, I am using OpenMpi as the communicator. Thanks!

francoishernandez · August 5, 2021, 1:17pm

Hi there,
I think we never properly checked this distributed setup in versions >=2.

For v1.2 you can check this entry in the docs.

If you feel interested in contributing to enable this feature in v2 you’re welcome to!