Hello! I’m studying the Transformer model and would like to clarify some questions about it’s training options that are proposed in FAQ.
- I found this formula for learning rate calculation in “Attention is all you need” paper:
Is that formula lays beneath the Noam decay scheme?
If that so, what’s the point to set
learning_rateoption when this scheme is used? How
learning_rate 2is used during training?
- I trained the model with these options for 43000 steps and got pretty good translation quality already (but surely there are no boundaries for perfection). How do you think, is there a reason in further training? In other words, considering these options on which step the learning rate will become so small that, there is no reason to continue training?