As the title says… I’m training a transformer for 3 days with default config, 5M sentences, RTX 2080Ti, 170K steps so far, and BLEU score increases are minimal with small fluctuations ~0.10 for the last past day. Should I consider this a convergence? Is there any chance for overfitting? BLEU is already good at 52.64 and actual model performance is also very good.
Indeed with 5M sentences it’s likely you will not gain much by now. Although there are always tricks to push the training further, like tuning the learning rate or aggregating more gradients.
Is there any chance for overfitting?
I don’t think overfitting can happen in this setting as the dataset is large, diverse (I suppose), and uniformly shuffled and dropout is applied on every layers.
Thanks @guillaumekln, that’s what I thought.