Hello Jean, Terence,
The paper from Wu is not very affirmative regarding these 3 nomalizations.
For instance there is a wide range of alpha giving the same results.
Do you guys have any feedback and benchmark within the onmt scope ?
I will try myself but interested to know in what configuration (languages, network size, norm parameters)