Issue in using distributed training in openNMT-TF

Changing the command to below worked for chief (removed the CUDA device number, I did not have to cancel the workers or ps for this):

CUDA_VISIBLE_DEVICES= onmt-main train_and_eval --model_type Transformer ... tee run1/en_es_transformer_a_chief.log

It gives me warnings saying Not found: Tensorflow device GPU:0 was not registered but the checkpoints and summaries are generated fine. Not exactly clear why GPU:0 gives problem here.

Mohammed Ayub

@guillaumekln @lockder
Any idea why there are no BLEU scores reported in the log files (evaluation section) while distributed training, even though I’m using the below configurations:

eval:
eval_delay: 1800
external_evaluators: [BLEU,BLEU-detok]

Thanks !

Mohammed Ayub

Should I run another command for evaluation separately (something like below on another GPU instance):

CUDA_VISIBLE_DEVICES=0 onmt-main train_and_eval --model_type Transformer --config run1/config_run.yml --auto_config --ps_hosts localhost:2224 --chief_host localhost:2223 --worker_hosts localhost:2222,localhost:2225,localhost:2226,localhost:2227,localhost:2228,localhost:2229,localhost:2230 --task_index 0 --task_type evaluator 2>&1 | tee run1/en_es_transformer_a_evaluation.log

as mentioned in the train_and_evaluate tensorflow doc. , if yes, which host does this process use (chief, ps, worker) ?

Mohammed Ayub

Let me know I can open up a separate issue if needed for this error ?

Mohammed Ayub

Not sure how the evaluation works here. @lockder, any experience with that?

yes I do run a command for the evaluator, usually I do use one server as evaluator and as ps server since both will fit inside the gpu memory
You have to do it on background threads.

Right know I’m doing a tagger model I’m not using right know a sequence to sequence model, I will try and tell you if the bleu score works for me. the f1 score will work and I see it on the log

Thanks !
If you can share all the commands (here or in a txt file) that you are using for the tagger model or seq-to-seq model, I would greatly appreciate it. It will be helpful for cross referencing.

Mohammed Ayub

train_and_eval --model modelPath --config configPath --run_dir runDirPath --data_dir dataPath --gpu_allow_growth --task_type chief --task_index 0–seed 2 --chief_host ip:port --ps_hosts ip:port,ip:port --worker_hosts ip:port,ip:port

and for the evaluator its not defined on the cluster by I send as task_type evaluator and task_index 0

I start with the chief and then I do send several ssh commands to the others machines starting for the workers, then ps and the last one the evaluator

Thanks @lockder
You did not specify the CUDA_VISIBLE_DEVICES=0, any reason why ?

Also, Did you happen to test if the BLEU scores show up, I ran the model yesterday, the predictions file was getting populated in the eval folder but I did not see any BLEU scores on the ouput log or std ouput.

Mohammed Ayub

because I have 8 machines with only 1 gpu each, so I do it on a distributed way but on the network :slight_smile:

make sure you set up in configfile.yml
save_eval_predictions: true
external_evaluators: BLEU

as I said I’m not doing sequence to sequence. I’m doing sequenceTagger and SequenceClassification for now and both they work with the score if I run an evaluator machine with the task added

@lockder Thanks for that. I do have the config file setup correctly.
Let me know if you happen to try sequence to sequence in the future.

Cheers !
Mohammed Ayub

@lockder @guillaumekln
I was able to see the BLUE scores after the evaluations. Worked perfectly fine. I was starting the evaluator differently hence it was crashing.

Couple of queries -

  1. Chief process seems to be stuck on model export (I don’t think model averaging export is happening) or it doesn’t seem to stop after the train_steps are completed. Presently, I manually CTRL-C to stop the processes. Is this a bug ? (Workers run fine and stop after the train steps are completed.)

  1. The steps numbers show up differently ie. I run for 25000 steps and I get 25010 checkpoint. Is this because it is asynchronously updating. ?

image

Thanks !

Mohammed Ayub