Reproducing pre-trained Transformer model

  1. To FULLY reproduce the TRAINING of the pre-trained model, WHICH SentencePiece parameters were used? I’ve got the SentencePiece model but I’d love to know how to do it MYSELF and get the same answer.
  2. When reproducing the BLEU (26 on news14 28 on news17 for the pre-trained) I presume the & must be detokenized (back into no underscores)?

Thanks in advance everyone eg @guillaumekln @francoishernandez

  1. The SentencePiece model was generated with this script: Look for “spm_train” to find the SentencePiece training parameters.
  2. Yes, BLEU is reported on detokenized output.
1 Like

I did this more than 2 years ago.
If you’re doing this for an academic purpose it’s fine.
If you want a higher Bleu (> 32-33) you’ll need to use back translations.

1 Like

Thanks @guillaumekln but why can’t I get the same BLEU score on the pre-trained onmt model EVEN AFTER detokenizing?

I get:
BLEU = 23.16, 51.6/29.0/17.5/11.1 (BP=0.998, ratio=0.998, hyp_len=52721, ref_len=52833)
not a BLEU of 26.

I’m comparing to from wmt14.
Is that the same as news14??
Isn’t news14 in the training??

news14 is not in the training of course.
post your command line to compute your BLEU

My BLEU is the perl script provided with OpeNMT-py:

perl tools/multi-bleu.perl


well not sure about what your files above are but the workflow is the following.

detokenized data => Tokenize with sentence piece => translate => tokenized output => detokenize output

preferably use multi-bleu-detok.perl on detokenized data to compare with papers.

if your is detokenized then you need to detokenize your .pred file and use the other perl script.

hope this helps.

1 Like

That’s what i’m doing BUT it looks like I’m using the wrong BLEU script . .

That was it.
I was using the wrong perl script.
Q. What would the non-detok BLEU perl have been doing??