I am facing some issues when I am trying to use the distributed training framework in openNMT-TF version. This may be a basic question, but I couldn’t resolve it myself nor find answer in github or stackoverflow.
I am running my codes in a linux system which has 4 GPU cards. Initially I set the CUDA_VISIBLE_DEVICES by
export CUDA_VISIBLE_DEVICES=0,1,2,3
Then I ran the following code in my terminal (each segment was run on a different screen in linux terminal)
Screen 1:
CUDA_VISIBLE_DEVICES=0 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:12240
–chief_host localhost:2223
–worker_hosts localhost:2225
–task_type ps
–task_index 0
Screen 2:
CUDA_VISIBLE_DEVICES=1 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:2224
–chief_host localhost:2223
–worker_hosts localhost:2225
–task_type chief
–task_index 0
Screen 3:
CUDA_VISIBLE_DEVICES=2,3 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:2224
–chief_host localhost:2223
–worker_hosts localhost:2225,localhost:2222
–task_type worker
–task_index 0
When I ran these codes in my terminal, in screen 2 and screen 3 where my chief and workers are running I am getting the following message:
INFO:tensorflow:TF_CONFIG environment variable: {u’cluster’: {u’ps’: [u’localhost:2224’], u’chief’: [u’localhost:2223’], u’worker’: [u’localhost:2225’]}, u’task’: {u’index’: 0, u’type’: u’chief’}}
2018-07-24 15:12:34.621963: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-07-24 15:12:34.891773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:06:00.0
totalMemory: 7.92GiB freeMemory: 7.80GiB
2018-07-24 15:12:34.891837: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:06:00.0, compute capability: 6.1)
2018-07-24 15:12:34.930176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:06:00.0, compute capability: 6.1)
INFO:tensorflow:Using config: {’_save_checkpoints_secs’: None, ‘_session_config’: gpu_options {
}
allow_soft_placement: true
, ‘_keep_checkpoint_max’: 3, ‘_task_type’: u’chief’, ‘_is_chief’: True, ‘_cluster_spec’: <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f0a74a65910>, ‘_save_checkpoints_steps’: 50, ‘_keep_checkpoint_every_n_hours’: 10000, ‘_service’: None, ‘_num_ps_replicas’: 1, ‘_tf_random_seed’: None, ‘_master’: u’grpc://localhost:2223’, ‘_num_worker_replicas’: 2, ‘_task_id’: 0, ‘_log_step_count_steps’: 50, ‘_model_dir’: ‘Spanish-English’, ‘_save_summary_steps’: 50}
INFO:tensorflow:Start Tensorflow server.
2018-07-24 15:12:34.933360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:06:00.0, compute capability: 6.1)
E0724 15:12:34.933676489 15565 ev_epoll1_linux.c:1051] grpc epoll fd: 24
2018-07-24 15:12:34.938235: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> localhost:2223}
2018-07-24 15:12:34.938269: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2224}
2018-07-24 15:12:34.938279: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2225}
2018-07-24 15:12:34.939204: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2223
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Number of trainable parameters: 72082255
2018-07-24 15:12:47.301593: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2018-07-24 15:12:47.301698: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-07-24 15:12:57.301841: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2018-07-24 15:12:57.301914: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/rep
Based on my understanding I feel that the workers and Chief are not able to interact with each other. I don’t know what is going wrong. Any help will be highly appreciated.