I’m using a p2.8xlarge instance with 8 GPU’s to run a distributed training model (out of box Transformer model). I do that presently running multiple screens on different GPU’s. The training is significantly faster than replicated mode and everything runs fine. Except that after training is complete for ‘x’ steps mentioned in the config(presently using 25000 in my config file), the chief and ps processes do not terminate automatically. Also, model averaging to get the avg folder is not created.
Presently I have to manually terminate them using Ctrl+C. I did a little bit of searching, may be its a distributed tensorflow issue as mentioned here Issue21305 and ShutDownGRPC Server
Let me know if anyone of you have a workaround for this.
I had the same issue. my solution its very dirty but I don’t have to search a better way to do it.
I just save each pid on a file when starts and I check the cpu use every 2 seconds if it goes down 15% for more than 30s I kill each service that is still open then I run again the training and since its already done, will do the avg.
its not clean but it works. if you find something much better please tell me