I am facing this pattern which I am unable to understand.
I have around aound 1.7 million data set and I train it for 120k steps.
At 40k I have accuracy of around 84% Step 40000/120000; acc: 83.17; ppl: 1.89; xent: 0.64;
at 50k this drops to 55% Step 50000/120000; acc: 54.80; ppl: 7.88; xent: 2.06
at 70k it again rises to 83% Step 70000/120000; acc: 84.64; ppl: 1.76; xent: 0.57;
Can anyone help me understand why this up-down-up behaviour. If this is going to happen how will I decide when to stop my training, which checkpoint model is best??
PS: I am using transformer model for training