WMT16 EN-DE Benchmark

(Vincent Nguyen) #1

In a recent paper (Sockeye a toolkit for NMT) some results were published for OpenNMT-Lua.

I would like to publish mine.

Corpus: CommonCrawl, Europarl, NewscommentaryV12, Rapid2016
6 epochs, 2 layers of size 512, encoder BRNN, Embeddings 256.
47.7M parameters
Newstest2017: 23.41

In section 4.3.1 of the Sockeye paper they take a 92.4M parameter model to show 19.70 for OpenNMT-Lua [Sockeye 23.18 / Marian 23.54 / Nematus 23.86]
Their setup: 20 epochs !! 1 layer 1000 / embeddings 500

Of course I am not using exactly their setup but the presentation is definitely misleading.

I will post more runs in this thread.

NB: we use an in-house very strong cleaning process which leads to retain only 4.1 M segments out of 5.5 M. This should not have a major impact, but just to outline that we used less data.

(Vincent Nguyen) #2

Second run.
2 layers of 1024, embeddings 256.
100.8M parameters
6 epochs (9 hours per epoch)
Newstest2017: 24.94

(Terence Lewis) #3

Interesting. I wonder what score you would get if you doubled the number of epochs.

(Vincent Nguyen) #4

Third run.
Same with embeddings 512, 121M Parameters.
Even though same ppl as previous run, Newstest2017: 24.67

@tel34 My point was just to make sure results reported by a few other papers were erroneous,
not to get the highest score possible, but indeed it’s already very competitive with published WMT
results without backtranslation.

(Vincent Nguyen) #5

Fourth run.
Slightly closer to the first exemple of the paper.
1 layer 1024, embeddings 512. 95.9M parameters.
For some reason I had to start the LR at 0.7 otherwise it diverged.
Newstest2017: 23.78

(Cong Duy Vu Hoang) #6

Hi Vincent,

Just a bit curious about your BLEU scores whether they are tokenised and case-sensitive? Thanks!

(Vincent Nguyen) #7

always cased NIST Bleu with mteval-13a.pl.
so I detokenize, and the tokenization is the one embedded in the NIST script.