Changing the command to below worked for chief (removed the CUDA device number, I did not have to cancel the workers or ps for this):
CUDA_VISIBLE_DEVICES= onmt-main train_and_eval --model_type Transformer ... tee run1/en_es_transformer_a_chief.log
It gives me warnings saying Not found: Tensorflow device GPU:0 was not registered but the checkpoints and summaries are generated fine. Not exactly clear why GPU:0 gives problem here.
@guillaumekln@lockder
Any idea why there are no BLEU scores reported in the log files (evaluation section) while distributed training, even though I’m using the below configurations:
yes I do run a command for the evaluator, usually I do use one server as evaluator and as ps server since both will fit inside the gpu memory
You have to do it on background threads.
Right know I’m doing a tagger model I’m not using right know a sequence to sequence model, I will try and tell you if the bleu score works for me. the f1 score will work and I see it on the log
Thanks !
If you can share all the commands (here or in a txt file) that you are using for the tagger model or seq-to-seq model, I would greatly appreciate it. It will be helpful for cross referencing.
Thanks @lockder
You did not specify the CUDA_VISIBLE_DEVICES=0, any reason why ?
Also, Did you happen to test if the BLEU scores show up, I ran the model yesterday, the predictions file was getting populated in the eval folder but I did not see any BLEU scores on the ouput log or std ouput.
because I have 8 machines with only 1 gpu each, so I do it on a distributed way but on the network
make sure you set up in configfile.yml
save_eval_predictions: true
external_evaluators: BLEU
as I said I’m not doing sequence to sequence. I’m doing sequenceTagger and SequenceClassification for now and both they work with the score if I run an evaluator machine with the task added
@lockder@guillaumekln
I was able to see the BLUE scores after the evaluations. Worked perfectly fine. I was starting the evaluator differently hence it was crashing.
Couple of queries -
Chief process seems to be stuck on model export (I don’t think model averaging export is happening) or it doesn’t seem to stop after the train_steps are completed. Presently, I manually CTRL-C to stop the processes. Is this a bug ? (Workers run fine and stop after the train steps are completed.)