Why different adam learning rate in transformer config file and recommend setting

Hi,

It might sound stupid, but could anybody tell me why the recommended setting of adam learning rate is 0.001 while in the transformer config files, it turns to be 2.0 here config-transformer-base-1GPU.yml?

Thanks.

It’s a long story that started with Google’s T2T implementation.

The actual recommended Adam LR was 0.002 but they didn’t want to use this value as the standard hparam so they normalized it to make it 1.0
(x500)

But months later they realized that the Transformer could converge faster using a LR=2.0

Bear in mind that the scheduler uses the “noam” decay.

1 Like

That really helps.
Thank you so much!

Nick