Any scripts to start multi-nodes training?

SefaZeng · September 28, 2022, 6:15am

I tried to use multi-node distributed training, but it’s not working. I have 2 nodes, each with 8 GPUs, and I set the world_size to 16.
The training scripts:

python3 \
    -m torch.distributed.launch \
    --nproc_per_node=${GPU_NUM} --nnodes=${WORLD_SIZE} --node_rank=${RANK} \
    --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} \
    $work_dir/OpenNMT/train.py \
    --config $work_dir/config_2.0_dist/dist_${RANK}.yml \
    --master_ip ${MASTER_ADDR} \
    --master_port ${MASTER_PORT} \

Some parameters like ${RANK} and ${MASTER_ADDR} are set by an in-house platform for model training.
The gpu_rank is set like this for dist_0.yml:

world_size: 16
gpu_ranks: [0, 1, 2, 3, 4, 5, 6, 7]

and this for dist_1.yml:

world_size: 16
gpu_ranks: [8, 9, 10, 11, 12, 13, 14, 15]

francoishernandez · September 28, 2022, 7:27am

Distributed training is not fully supported yet in OpenNMT-py v2.X.
You can test this with the legacy 1.2.0 version though. You just need to run the train script directly on each node with the corresponding args.
https://opennmt.net/OpenNMT-py/legacy/FAQ.html#do-you-support-multi-gpu

SefaZeng · September 28, 2022, 7:55am

Thx for your reply! I’d like to ask what is the reason for not being able to use distributed training with v2.x ?
As the data-processing pipeline in v2.x is a good thing to me, I’d like to use it rather than v1.x.

SefaZeng · September 29, 2022, 8:36am

It looks like deleting torch.distributed.launch works for me.

python3 \
    $work_dir/OpenNMT/train.py \
    --config $work_dir/config_2.0_dist/dist_${RANK}.yml \
    --master_ip ${MASTER_ADDR} \
    --master_port ${MASTER_PORT} \

francoishernandez · September 29, 2022, 12:36pm

It’s just that it was not properly re-implemented. It should not be that difficult to add if you want to contribute. Look for stride // offset mechanism in the code. This should be adapted to take into account the actual gpu_rank instead of the local device index.
It’s probably more testing and validating than code at this point.