Poor translation result with English->German and German->English pre-trained models

pytorch

(Vardaan Pahuja) #1

I am trying to use the pre-trained models from OpenNMT, but the translation quality is very poor
http://opennmt.net/Models-py/
Here is my code

perl tools/tokenizer.perl -a -no-escape -l en -q < sample_sentences.txt > sample_sentences.atok
python translate.py -gpu 0 -model available_models/averaged-10-epoch.pt -src sample_sentences.atok -verbose -output sample_sentences.de.atok

The output German translation for the sentence The cat sat on the mat is ▁The cat ?
Input: Hello, how are you? Output: ▁Nein , ▁viel ▁mehr !
Input: How many horses are there in the stable? Output: ▁Ganz ▁einfach .
I even tried some training sentences from WMT like
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
Output: ▁Ganz ▁einfach ▁nur : ▁Das ▁Parlament ▁hat ▁sich ▁in ▁seine m ▁ganz en ▁Haus ▁versteckt .
Please enlighten me where am I wrong. The model claims to have a decent BLEU score of >25


(Guillaume Klein) #2

You should apply the same tokenization as used during the training. In that case, apply the SentencePiece model that is included in the model archive.


(Vardaan Pahuja) #3

For the pre-trained German -> English model, I get a lot of <unk> in the translation output even for the training sentences. I saw the training pre-processing script https://github.com/pytorch/fairseq/blob/master/data/prepare-iwslt14.sh
which uses moses tokenization, lowercasing followed by BPE encoding, but the results worsen with use of BPE in my case
Questions:

  1. I am using python apply_bpe.py -c <code_file> < <input_file> > output_file. Should I also give some vocab file as input ?
  2. This model is trained using the code with SHA d4ab35a, is there any reason it should misbehave at inference time with the latest code?
  3. Is there a decoding step required when I get the English output, as was required in case of sentencepiece?
    Thanks for your patience

(Vardaan Pahuja) #4

UPDATE: Issue resolved. There was an issue at my end as I was using a custom dataset, which didn’t have the attribute ‘data_type’ defined. It works reasonably well now