Changing the command to below worked for
chief (removed the CUDA device number, I did not have to cancel the workers or ps for this):
CUDA_VISIBLE_DEVICES= onmt-main train_and_eval --model_type Transformer ... tee run1/en_es_transformer_a_chief.log
It gives me warnings saying
Not found: Tensorflow device GPU:0 was not registered but the checkpoints and summaries are generated fine. Not exactly clear why GPU:0 gives problem here.