I’m using a p2.8xlarge instance with 8 GPU’s to run a distributed training model (out of box Transformer model). I do that presently running multiple
screens on different GPU’s. The training is significantly faster than replicated mode and everything runs fine. Except that after training is complete for ‘x’ steps mentioned in the config(presently using 25000 in my config file), the
ps processes do not terminate automatically. Also, model averaging to get the
avg folder is not created.
Presently I have to manually terminate them using Ctrl+C. I did a little bit of searching, may be its a distributed tensorflow issue as mentioned here Issue21305 and ShutDownGRPC Server
Let me know if anyone of you have a workaround for this.
I had the same issue. my solution its very dirty but I don’t have to search a better way to do it.
I just save each pid on a file when starts and I check the cpu use every 2 seconds if it goes down 15% for more than 30s I kill each service that is still open then I run again the training and since its already done, will do the avg.
its not clean but it works. if you find something much better please tell me
Thanks @lockder for that suggestion. I might try to do the same.
To save some time ,do you have a script build for it, if yes, do you mind sharing it.
its integrated on several pieces of code inside a python file, sorry
I use, multiprocessing and psutil from python to do it
No. That’s fine. I’ll try doing the same. Appreciate your help.