I tried to use multi-node distributed training, but it’s not working. I have 2 nodes, each with 8 GPUs, and I set the world_size to 16.
The training scripts:
Distributed training is not fully supported yet in OpenNMT-py v2.X.
You can test this with the legacy 1.2.0 version though. You just need to run the train script directly on each node with the corresponding args. https://opennmt.net/OpenNMT-py/legacy/FAQ.html#do-you-support-multi-gpu
Thx for your reply! I’d like to ask what is the reason for not being able to use distributed training with v2.x ?
As the data-processing pipeline in v2.x is a good thing to me, I’d like to use it rather than v1.x.
It’s just that it was not properly re-implemented. It should not be that difficult to add if you want to contribute. Look for stride // offset mechanism in the code. This should be adapted to take into account the actual gpu_rank instead of the local device index.
It’s probably more testing and validating than code at this point.