In section 4.3.1 of the Sockeye paper they take a 92.4M parameter model to show 19.70 for OpenNMT-Lua [Sockeye 23.18 / Marian 23.54 / Nematus 23.86]
Their setup: 20 epochs !! 1 layer 1000 / embeddings 500
Of course I am not using exactly their setup but the presentation is definitely misleading.
I will post more runs in this thread.
NB: we use an in-house very strong cleaning process which leads to retain only 4.1 M segments out of 5.5 M. This should not have a major impact, but just to outline that we used less data.
Third run.
Same with embeddings 512, 121M Parameters.
Even though same ppl as previous run, Newstest2017: 24.67
@tel34 My point was just to make sure results reported by a few other papers were erroneous,
not to get the highest score possible, but indeed it’s already very competitive with published WMT
results without backtranslation.
Fourth run.
Slightly closer to the first exemple of the paper.
1 layer 1024, embeddings 512. 95.9M parameters.
For some reason I had to start the LR at 0.7 otherwise it diverged.
Newstest2017: 23.78