How to evaluate when using distributed training

I followed the instructions in this issue to perform distributed training.
I know I have to use “evaluator” to evaluate, but I want to pause training when evaluating, which is what happens when not using distributed training.

Should allocate a new GPU in order to evaluate, or just run the command with “–task_type evaluator” on the “chief” machine?

I don’t think that’s possible. What would be the benefit?

Thank you guillaumekln.

I let the “chief” machine to perform evaluation as well, and it works fine.

1 Like

what I do is use the parameter serving machine and evaluate machine on the same machine with 2 different process and port, because the ps machine usually use less gpu memory