Issue in using distributed training in openNMT-TF

tensorflow

(Parthasarathy Subburaj) #1

I am facing some issues when I am trying to use the distributed training framework in openNMT-TF version. This may be a basic question, but I couldn’t resolve it myself nor find answer in github or stackoverflow.

I am running my codes in a linux system which has 4 GPU cards. Initially I set the CUDA_VISIBLE_DEVICES by

export CUDA_VISIBLE_DEVICES=0,1,2,3

Then I ran the following code in my terminal (each segment was run on a different screen in linux terminal)

Screen 1:
CUDA_VISIBLE_DEVICES=0 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:12240
–chief_host localhost:2223
–worker_hosts localhost:2225
–task_type ps
–task_index 0

Screen 2:
CUDA_VISIBLE_DEVICES=1 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:2224
–chief_host localhost:2223
–worker_hosts localhost:2225
–task_type chief
–task_index 0

Screen 3:
CUDA_VISIBLE_DEVICES=2,3 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:2224
–chief_host localhost:2223
–worker_hosts localhost:2225,localhost:2222
–task_type worker
–task_index 0

When I ran these codes in my terminal, in screen 2 and screen 3 where my chief and workers are running I am getting the following message:

INFO:tensorflow:TF_CONFIG environment variable: {u’cluster’: {u’ps’: [u’localhost:2224’], u’chief’: [u’localhost:2223’], u’worker’: [u’localhost:2225’]}, u’task’: {u’index’: 0, u’type’: u’chief’}}
2018-07-24 15:12:34.621963: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-07-24 15:12:34.891773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:06:00.0
totalMemory: 7.92GiB freeMemory: 7.80GiB
2018-07-24 15:12:34.891837: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:06:00.0, compute capability: 6.1)
2018-07-24 15:12:34.930176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:06:00.0, compute capability: 6.1)
INFO:tensorflow:Using config: {’_save_checkpoints_secs’: None, ‘_session_config’: gpu_options {
}
allow_soft_placement: true
, ‘_keep_checkpoint_max’: 3, ‘_task_type’: u’chief’, ‘_is_chief’: True, ‘_cluster_spec’: <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f0a74a65910>, ‘_save_checkpoints_steps’: 50, ‘_keep_checkpoint_every_n_hours’: 10000, ‘_service’: None, ‘_num_ps_replicas’: 1, ‘_tf_random_seed’: None, ‘_master’: u’grpc://localhost:2223’, ‘_num_worker_replicas’: 2, ‘_task_id’: 0, ‘_log_step_count_steps’: 50, ‘_model_dir’: ‘Spanish-English’, ‘_save_summary_steps’: 50}
INFO:tensorflow:Start Tensorflow server.
2018-07-24 15:12:34.933360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:06:00.0, compute capability: 6.1)
E0724 15:12:34.933676489 15565 ev_epoll1_linux.c:1051] grpc epoll fd: 24
2018-07-24 15:12:34.938235: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> localhost:2223}
2018-07-24 15:12:34.938269: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2224}
2018-07-24 15:12:34.938279: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2225}
2018-07-24 15:12:34.939204: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2223
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Number of trainable parameters: 72082255
2018-07-24 15:12:47.301593: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2018-07-24 15:12:47.301698: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-07-24 15:12:57.301841: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2018-07-24 15:12:57.301914: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/rep

Based on my understanding I feel that the workers and Chief are not able to interact with each other. I don’t know what is going wrong. Any help will be highly appreciated.


(Guillaume Klein) #2

What is your TensorFlow version?

There are some typos in your command lines that could cause issue:

should set –ps_hosts localhost:2224

Looks like the worker at localhost:2222 was not started?

Can you address those points and try again?


(Parthasarathy Subburaj) #3

Thanks for your response. Currently I am using tensorflow 1.4.1.

I changed –ps_hosts localhost:12240 to ps_hosts localhost:2224 and ran my code but still its showing the same error as before.

Also, how do I start the worker at localhost:2222, won’t executing the above command in screen 3 do it ?


(Guillaume Klein) #4

Command 3 is starting the worker 0 (–task_index 0), which is localhost:2225.


(Parthasarathy Subburaj) #5

Thank you.

But then how will I assign the task_index for localhost:2225.
when I tried giving

CUDA_VISIBLE_DEVICES=2,3 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:2224
–chief_host localhost:2223
–worker_hosts localhost:2225,localhost:2222
–task_type worker
–task_index 0,1

I am getting the following error

onmt-main: error: argument --task_index: invalid int value: ‘0,1’

I also tried using a single worker by using the command
UDA_VISIBLE_DEVICES=2 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:2224
–chief_host localhost:2223
–worker_hosts localhost:2225
–task_type worker
–task_index 0

Even this gave me the same error
2018-07-24 15:12:57.301841: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0


(Guillaume Klein) #6

I checked that the following commands work on my test server:

CUDA_VISIBLE_DEVICES= python -m bin.main train_and_eval --config config/opennmt-defaults.yml config/data/toy-ende.yml --model_type NMTSmall --ps_hosts localhost:2224 --chief_host localhost:2223 --worker_hosts localhost:2225 --task_index 0 --task_type worker

CUDA_VISIBLE_DEVICES=0 python -m bin.main train_and_eval --config config/opennmt-defaults.yml config/data/toy-ende.yml --model_type NMTSmall --ps_hosts localhost:2224 --chief_host localhost:2223 --worker_hosts localhost:2225 --task_index 0 --task_type chief

CUDA_VISIBLE_DEVICES=0 python -m bin.main train_and_eval --config config/opennmt-defaults.yml config/data/toy-ende.yml --model_type NMTSmall --ps_hosts localhost:2224 --chief_host localhost:2223 --worker_hosts localhost:2225 --task_index 0 --task_type ps

So the message indicates that the instances can’t communicate together, either because the hostname and port are wrong or because of system-wide restrictions (we can’t help you with that).

But then how will I assign the task_index for localhost:2225.

You have to run 2 separate instances:

CUDA_VISIBLE_DEVICES=2 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:2224
–chief_host localhost:2223
–worker_hosts localhost:2225,localhost:2222
–task_type worker
–task_index 0

CUDA_VISIBLE_DEVICES=3 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:2224
–chief_host localhost:2223
–worker_hosts localhost:2225,localhost:2222
–task_type worker
–task_index 1

Note that all instances should have the same cluster settings (i.e. same --ps_hosts, --chief_host, --worker_hosts).

If you don’t succeed in setting your system correctly, you can try replicated training instead:

http://opennmt.net/OpenNMT-tf/training.html#replicated-training


(Parthasarathy Subburaj) #7

Thank you for your support.