Using learning_rate option with Noam decay scheme

Hello! I’m studying the Transformer model and would like to clarify some questions about it’s training options that are proposed in FAQ.

  1. I found this formula for learning rate calculation in “Attention is all you need” paper:
    Is that formula lays beneath the Noam decay scheme?
    If that so, what’s the point to set learning_rate option when this scheme is used? How learning_rate 2 is used during training?
  2. I trained the model with these options for 43000 steps and got pretty good translation quality already (but surely there are no boundaries for perfection). How do you think, is there a reason in further training? In other words, considering these options on which step the learning rate will become so small that, there is no reason to continue training?

you can read this:

but in a nutshell:
if you set lr=1 then yes this is the formula you quote.
lr=2 by default is also set in T2T now, because it showed faster convergence.

I can tell you that even after 100k steps even though the actual lr becomes small, it still learns.
However it all depdends on the size of your transformer and the dataset you use.