Issue in using distributed training in openNMT-TF

I am facing some issues when I am trying to use the distributed training framework in openNMT-TF version. This may be a basic question, but I couldn’t resolve it myself nor find answer in github or stackoverflow.

I am running my codes in a linux system which has 4 GPU cards. Initially I set the CUDA_VISIBLE_DEVICES by

export CUDA_VISIBLE_DEVICES=0,1,2,3

Then I ran the following code in my terminal (each segment was run on a different screen in linux terminal)

Screen 1:
CUDA_VISIBLE_DEVICES=0 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:12240
–chief_host localhost:2223
–worker_hosts localhost:2225
–task_type ps
–task_index 0

Screen 2:
CUDA_VISIBLE_DEVICES=1 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:2224
–chief_host localhost:2223
–worker_hosts localhost:2225
–task_type chief
–task_index 0

Screen 3:
CUDA_VISIBLE_DEVICES=2,3 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:2224
–chief_host localhost:2223
–worker_hosts localhost:2225,localhost:2222
–task_type worker
–task_index 0

When I ran these codes in my terminal, in screen 2 and screen 3 where my chief and workers are running I am getting the following message:

INFO:tensorflow:TF_CONFIG environment variable: {u’cluster’: {u’ps’: [u’localhost:2224’], u’chief’: [u’localhost:2223’], u’worker’: [u’localhost:2225’]}, u’task’: {u’index’: 0, u’type’: u’chief’}}
2018-07-24 15:12:34.621963: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-07-24 15:12:34.891773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.7335
pciBusID: 0000:06:00.0
totalMemory: 7.92GiB freeMemory: 7.80GiB
2018-07-24 15:12:34.891837: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:06:00.0, compute capability: 6.1)
2018-07-24 15:12:34.930176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:06:00.0, compute capability: 6.1)
INFO:tensorflow:Using config: {’_save_checkpoints_secs’: None, ‘_session_config’: gpu_options {
}
allow_soft_placement: true
, ‘_keep_checkpoint_max’: 3, ‘_task_type’: u’chief’, ‘_is_chief’: True, ‘_cluster_spec’: <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f0a74a65910>, ‘_save_checkpoints_steps’: 50, ‘_keep_checkpoint_every_n_hours’: 10000, ‘_service’: None, ‘_num_ps_replicas’: 1, ‘_tf_random_seed’: None, ‘_master’: u’grpc://localhost:2223’, ‘_num_worker_replicas’: 2, ‘_task_id’: 0, ‘_log_step_count_steps’: 50, ‘_model_dir’: ‘Spanish-English’, ‘_save_summary_steps’: 50}
INFO:tensorflow:Start Tensorflow server.
2018-07-24 15:12:34.933360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:06:00.0, compute capability: 6.1)
E0724 15:12:34.933676489 15565 ev_epoll1_linux.c:1051] grpc epoll fd: 24
2018-07-24 15:12:34.938235: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job chief -> {0 -> localhost:2223}
2018-07-24 15:12:34.938269: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2224}
2018-07-24 15:12:34.938279: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2225}
2018-07-24 15:12:34.939204: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2223
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Number of trainable parameters: 72082255
2018-07-24 15:12:47.301593: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2018-07-24 15:12:47.301698: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/replica:0/task:0
2018-07-24 15:12:57.301841: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2018-07-24 15:12:57.301914: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:worker/rep

Based on my understanding I feel that the workers and Chief are not able to interact with each other. I don’t know what is going wrong. Any help will be highly appreciated.

What is your TensorFlow version?

There are some typos in your command lines that could cause issue:

should set –ps_hosts localhost:2224

Looks like the worker at localhost:2222 was not started?

Can you address those points and try again?

Thanks for your response. Currently I am using tensorflow 1.4.1.

I changed –ps_hosts localhost:12240 to ps_hosts localhost:2224 and ran my code but still its showing the same error as before.

Also, how do I start the worker at localhost:2222, won’t executing the above command in screen 3 do it ?

Command 3 is starting the worker 0 (–task_index 0), which is localhost:2225.

Thank you.

But then how will I assign the task_index for localhost:2225.
when I tried giving

CUDA_VISIBLE_DEVICES=2,3 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:2224
–chief_host localhost:2223
–worker_hosts localhost:2225,localhost:2222
–task_type worker
–task_index 0,1

I am getting the following error

onmt-main: error: argument --task_index: invalid int value: ‘0,1’

I also tried using a single worker by using the command
UDA_VISIBLE_DEVICES=2 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:2224
–chief_host localhost:2223
–worker_hosts localhost:2225
–task_type worker
–task_index 0

Even this gave me the same error
2018-07-24 15:12:57.301841: I tensorflow/core/distributed_runtime/master.cc:221] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0

I checked that the following commands work on my test server:

CUDA_VISIBLE_DEVICES= python -m bin.main train_and_eval --config config/opennmt-defaults.yml config/data/toy-ende.yml --model_type NMTSmall --ps_hosts localhost:2224 --chief_host localhost:2223 --worker_hosts localhost:2225 --task_index 0 --task_type worker

CUDA_VISIBLE_DEVICES=0 python -m bin.main train_and_eval --config config/opennmt-defaults.yml config/data/toy-ende.yml --model_type NMTSmall --ps_hosts localhost:2224 --chief_host localhost:2223 --worker_hosts localhost:2225 --task_index 0 --task_type chief

CUDA_VISIBLE_DEVICES=0 python -m bin.main train_and_eval --config config/opennmt-defaults.yml config/data/toy-ende.yml --model_type NMTSmall --ps_hosts localhost:2224 --chief_host localhost:2223 --worker_hosts localhost:2225 --task_index 0 --task_type ps

So the message indicates that the instances can’t communicate together, either because the hostname and port are wrong or because of system-wide restrictions (we can’t help you with that).

But then how will I assign the task_index for localhost:2225.

You have to run 2 separate instances:

CUDA_VISIBLE_DEVICES=2 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:2224
–chief_host localhost:2223
–worker_hosts localhost:2225,localhost:2222
–task_type worker
–task_index 0

CUDA_VISIBLE_DEVICES=3 onmt-main train_and_eval --model_type NMTSmall
–config config/my_config.yml
–ps_hosts localhost:2224
–chief_host localhost:2223
–worker_hosts localhost:2225,localhost:2222
–task_type worker
–task_index 1

Note that all instances should have the same cluster settings (i.e. same --ps_hosts, --chief_host, --worker_hosts).

If you don’t succeed in setting your system correctly, you can try replicated training instead:

http://opennmt.net/OpenNMT-tf/training.html#replicated-training

Thank you for your support.

Hi @guillaumekln

Is there an order for these commands to execute ? like start ‘chief’ first, then ‘ps’ and then ‘worker’ …

Also, For the task type ‘worker’ you did not give any CUDA devices, any reason why ?

Mohammed Ayub

I have one g2.8xlarge instance with 8 GPU’s on the same machine, how can I optimally distribute the training on this machine ?

Appreciate any help.
Thanks !

Mohammed Ayub

For a single machine, simply use replicated training unless you have a good reason not to. Combined with the Transformer model you should get pretty good performance:

onmt-main [...] --model_type Transformer --config data.yml --auto_config --num_gpus 8

I don’t think so. They should wait for each other but I did not experimented with distributed setups much.

No specific reasons. It’s true that a worker should run on a GPU while the ps can stay on the CPU.

on distributed training i usually start with the chief and for the other machines I don’t have any special order
but I’m using right now 8 machines not in a single machine

I found slower the training on a replicated training rather than distributed

This is expected as distributed updates are asynchronous. However, replicated training gives you the benefit of a larger batch size, which is strongly recommended for Transformer models.

1 Like

Thanks @guillaumekln @lockder

For my run I’m running train batch size defaults to batch_size: 3072 and eval batch_size: 32 is this ideal for transformer model ?

I also saw a big drop in train time when using distributed training vs replicated training. If I’m understanding correctly, even though we have faster train time using distributed training we are not loosing on accuracy right , did you see any drop in model performance ?

Mohammed Ayub

Also, is continue training and fine tuning work seamlessly on distributed like replicated ? I will be trying out this, just wanted to get you thoughts on this if you have already tried and ran into any roadblocks.

Mohammed Ayub

If you are using replicated training, the effective batch size will be 3072 * N where N is the number of GPUs. 3072 is tuned for 8GB (a bit less actually for security), so if you have more memory you can increase this value also. For reference, the original paper uses a total batch size of 25K.

Regarding the Transformer, you will loose model performance in using distributed training because it does not increase the effective batch size. Unless you are using the gradient accumulation trick from the latest release.

I think it should be identical.

Thank you for those great insights.

How do I use this trick, any changes that are required ?

Mohammed Ayub

If you use the Transformer model and the --auto_config flag, it is enabled by default to simulate N = 8 (see my previous post). You can also set the value manually, see gradients_accum in the configuration reference:

http://opennmt.net/OpenNMT-tf/configuration_reference.html

Yes I’m using --auto_config , that solves it I guess. Thanks !

Mohammed Ayub

@guillaumekln @lockder

Update: chief keeps crashing while running the distributed training while ps and workers are still continue to run without any error.
Below are the details:

Error:
tensorflow.python.framework.errors_impl.InternalError: could not parse rpc response

[[Node: transformer/decoder/layer_0/ffn/conv1d/kernel_S2659 = _Recv[_start_time=0, client_terminated=false, recv_device="/job:chief/replica:0/task:0/device:GPU:0", send_device="/job:ps/replica:0/task:0/device:GPU:0", send_device_incarnation=142405553525176343, tensor_name="edge_115_transformer/decoder/layer_0/ffn/conv1d/kernel", tensor_type=DT_FLOAT, _device="/job:chief/replica:0/task:0/device:GPU:0"]()]]

[[Node:optim/gradients/transformer/parallel_0/transformer/decoder_1/layer_0/masked_multi_head/conv1d/BiasAdd_grad/BiasAddGrad_S3579 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/device:GPU:0", send_device="/job:chief/replica:0/task:0/device:GPU:0", send_device_incarnation=-1748274500797833712, tensor_name="edge_11464...iasAddGrad", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:GPU:0"]()]]

Commands I’m running:

CUDA_VISIBLE_DEVICES=0 onmt-main train_and_eval --model_type Transformer --config run1/config_run.yml --auto_config --ps_hosts localhost:2224 --chief_host localhost:2223 --worker_hosts localhost:2222,localhost:2225,localhost:2226,localhost:2227,localhost:2228,localhost:2229,localhost:2230 --task_index 0 --task_type ps 2>&1 | tee run1/en_es_transformer_a_ps.log
CUDA_VISIBLE_DEVICES=0 onmt-main train_and_eval --model_type Transformer --config run1/config_run.yml --auto_config --ps_hosts localhost:2224 --chief_host localhost:2223 --worker_hosts localhost:2222,localhost:2225,localhost:2226,localhost:2227,localhost:2228,localhost:2229,localhost:2230 --task_index 0 --task_type chief 2>&1 | tee run1/en_es_transformer_a_chief.log
CUDA_VISIBLE_DEVICES=1 onmt-main train_and_eval --model_type Transformer --config run1/config_run.yml --auto_config --ps_hosts localhost:2224 --chief_host localhost:2223 --worker_hosts localhost:2222,localhost:2225,localhost:2226,localhost:2227,localhost:2228,localhost:2229,localhost:2230 --task_index 0 --task_type worker 2>&1 | tee run1/en_es_transformer_a_worker0.log
CUDA_VISIBLE_DEVICES=2 onmt-main train_and_eval --model_type Transformer --config run1/config_run.yml --auto_config --ps_hosts localhost:2224 --chief_host localhost:2223 --worker_hosts localhost:2222,localhost:2225,localhost:2226,localhost:2227,localhost:2228,localhost:2229,localhost:2230 --task_index 1 --task_type worker 2>&1 | tee run1/en_es_transformer_a_worker1.log
CUDA_VISIBLE_DEVICES=3 onmt-main train_and_eval --model_type Transformer --config run1/config_run.yml --auto_config --ps_hosts localhost:2224 --chief_host localhost:2223 --worker_hosts localhost:2222,localhost:2225,localhost:2226,localhost:2227,localhost:2228,localhost:2229,localhost:2230 --task_index 2 --task_type worker 2>&1 | tee run1/en_es_transformer_a_worker2.log
CUDA_VISIBLE_DEVICES=4 onmt-main train_and_eval --model_type Transformer --config run1/config_run.yml --auto_config --ps_hosts localhost:2224 --chief_host localhost:2223 --worker_hosts localhost:2222,localhost:2225,localhost:2226,localhost:2227,localhost:2228,localhost:2229,localhost:2230 --task_index 3 --task_type worker 2>&1 | tee run1/en_es_transformer_a_worker3.log
CUDA_VISIBLE_DEVICES=5 onmt-main train_and_eval --model_type Transformer --config run1/config_run.yml --auto_config --ps_hosts localhost:2224 --chief_host localhost:2223 --worker_hosts localhost:2222,localhost:2225,localhost:2226,localhost:2227,localhost:2228,localhost:2229,localhost:2230 --task_index 4 --task_type worker 2>&1 | tee run1/en_es_transformer_a_worker4.log
CUDA_VISIBLE_DEVICES=6 onmt-main train_and_eval --model_type Transformer --config run1/config_run.yml --auto_config --ps_hosts localhost:2224 --chief_host localhost:2223 --worker_hosts localhost:2222,localhost:2225,localhost:2226,localhost:2227,localhost:2228,localhost:2229,localhost:2230 --task_index 5 --task_type worker 2>&1 | tee run1/en_es_transformer_a_worker5.log
CUDA_VISIBLE_DEVICES=7 onmt-main train_and_eval --model_type Transformer --config run1/config_run.yml --auto_config --ps_hosts localhost:2224 --chief_host localhost:2223 --worker_hosts localhost:2222,localhost:2225,localhost:2226,localhost:2227,localhost:2228,localhost:2229,localhost:2230 --task_index 6 --task_type worker 2>&1 | tee run1/en_es_transformer_a_worker6.log

not sure what I’m doing wrong ?

Any help appreciated.

Mohammed Ayub