I’m trying to run a distributed training on multiple servers, but I’ must be doing something wrong, the documentation is not well explained or I don’t understand it fully.
I tried to do run the next command
on host 0
CUDA_VISIBLE_DEVICES=0 onmt-main train […]
–ps_hosts host0:2222
–chief_host host0:2223
–worker_hosts host1:2224
–task_type ps
–task_index 0
then in another terminal
CUDA_VISIBLE_DEVICES=0 onmt-main train […]
–ps_hosts host0:2222
–chief_host host0:2223
–worker_hosts host1:2224
–task_type chief
–task_index 0
on host 1
CUDA_VISIBLE_DEVICES=0 onmt-main train […]
–ps_hosts host0:2222
–chief_host host0:2223
–worker_hosts host1:2224
–task_type worker
–task_index 0
also I tried to run only with chief and worker but always the last sencente of the system is " INFO:tensorflow:Graph was finalized. "
Then the terminal remaings waiting to forever. What I ’ am doing wrong?