I’ve tried to set max_grad_norm
with different numbers. According to the documentation:
-max_grad_norm
(default:5
)
Clip the gradients L2-norm to this value. Set to 0 to disable.
However, I think what I’m doing is lack of reasoning.
Would it be a great strategy if I list every gradient and get something like average while training? And I also wonder if there’s any way to get each gradient in training phase.