OpenNMT-tf Distributed Training - Processes don't end after training completes


I’m using a p2.8xlarge instance with 8 GPU’s to run a distributed training model (out of box Transformer model). I do that presently running multiple screens on different GPU’s. The training is significantly faster than replicated mode and everything runs fine. Except that after training is complete for ‘x’ steps mentioned in the config(presently using 25000 in my config file), the chief and ps processes do not terminate automatically. Also, model averaging to get the avg folder is not created.

Presently I have to manually terminate them using Ctrl+C. I did a little bit of searching, may be its a distributed tensorflow issue as mentioned here Issue21305 and ShutDownGRPC Server

Let me know if anyone of you have a workaround for this.

Thanks !

Mohammed Ayub

I had the same issue. my solution its very dirty but I don’t have to search a better way to do it.
I just save each pid on a file when starts and I check the cpu use every 2 seconds if it goes down 15% for more than 30s I kill each service that is still open then I run again the training and since its already done, will do the avg.
its not clean but it works. if you find something much better please tell me :slight_smile:

Thanks @lockder for that suggestion. I might try to do the same.
To save some time ,do you have a script build for it, if yes, do you mind sharing it.

Mohammed Ayub

its integrated on several pieces of code inside a python file, sorry

I use, multiprocessing and psutil from python to do it

No. That’s fine. I’ll try doing the same. Appreciate your help.