I’m using a p2.8xlarge instance with 8 GPU’s to run a distributed training model (out of box Transformer model). I do that presently running multiple
screens on different GPU’s. The training is significantly faster than replicated mode and everything runs fine. Except that after training is complete for ‘x’ steps mentioned in the config(presently using 25000 in my config file), the
ps processes do not terminate automatically. Also, model averaging to get the
avg folder is not created.
Let me know if anyone of you have a workaround for this.