I followed the instructions in this issue to perform distributed training.
I know I have to use “evaluator” to evaluate, but I want to pause training when evaluating, which is what happens when not using distributed training.
Should allocate a new GPU in order to evaluate, or just run the command with “–task_type evaluator” on the “chief” machine?
what I do is use the parameter serving machine and evaluate machine on the same machine with 2 different process and port, because the ps machine usually use less gpu memory