Distributed training on multiple servers

lockder · July 4, 2018, 10:25am

I’m trying to run a distributed training on multiple servers, but I’ must be doing something wrong, the documentation is not well explained or I don’t understand it fully.
I tried to do run the next command

on host 0
CUDA_VISIBLE_DEVICES=0 onmt-main train […]
–ps_hosts host0:2222
–chief_host host0:2223
–worker_hosts host1:2224
–task_type ps
–task_index 0

then in another terminal
CUDA_VISIBLE_DEVICES=0 onmt-main train […]
–ps_hosts host0:2222
–chief_host host0:2223
–worker_hosts host1:2224
–task_type chief
–task_index 0

on host 1

CUDA_VISIBLE_DEVICES=0 onmt-main train […]
–ps_hosts host0:2222
–chief_host host0:2223
–worker_hosts host1:2224
–task_type worker
–task_index 0

also I tried to run only with chief and worker but always the last sencente of the system is " INFO:tensorflow:Graph was finalized. "

Then the terminal remaings waiting to forever. What I ’ am doing wrong?

guillaumekln · July 4, 2018, 7:47pm

As you also opened an issue on GitHub, let’s continue the discussion there: