How to evaluate when using distributed training

kdminamoto · May 29, 2019, 2:31pm

I followed the instructions in this issue to perform distributed training.
I know I have to use “evaluator” to evaluate, but I want to pause training when evaluating, which is what happens when not using distributed training.

Should allocate a new GPU in order to evaluate, or just run the command with “–task_type evaluator” on the “chief” machine?

guillaumekln · May 31, 2019, 7:51am

I don’t think that’s possible. What would be the benefit?

kdminamoto · May 31, 2019, 8:26am

Thank you guillaumekln.

I let the “chief” machine to perform evaluation as well, and it works fine.

lockder · June 3, 2019, 2:08pm

what I do is use the parameter serving machine and evaluate machine on the same machine with 2 different process and port, because the ps machine usually use less gpu memory