Poor translation result with English->German and German->English pre-trained models


(Vardaan Pahuja) #1

I am trying to use the pre-trained models from OpenNMT, but the translation quality is very poor
Here is my code

perl tools/tokenizer.perl -a -no-escape -l en -q < sample_sentences.txt > sample_sentences.atok
python translate.py -gpu 0 -model available_models/averaged-10-epoch.pt -src sample_sentences.atok -verbose -output sample_sentences.de.atok

The output German translation for the sentence The cat sat on the mat is ▁The cat ?
Input: Hello, how are you? Output: ▁Nein , ▁viel ▁mehr !
Input: How many horses are there in the stable? Output: ▁Ganz ▁einfach .
I even tried some training sentences from WMT like
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
Output: ▁Ganz ▁einfach ▁nur : ▁Das ▁Parlament ▁hat ▁sich ▁in ▁seine m ▁ganz en ▁Haus ▁versteckt .
Please enlighten me where am I wrong. The model claims to have a decent BLEU score of >25

(Guillaume Klein) #2

You should apply the same tokenization as used during the training. In that case, apply the SentencePiece model that is included in the model archive.

(Vardaan Pahuja) #3

For the pre-trained German -> English model, I get a lot of <unk> in the translation output even for the training sentences. I saw the training pre-processing script https://github.com/pytorch/fairseq/blob/master/data/prepare-iwslt14.sh
which uses moses tokenization, lowercasing followed by BPE encoding, but the results worsen with use of BPE in my case

  1. I am using python apply_bpe.py -c <code_file> < <input_file> > output_file. Should I also give some vocab file as input ?
  2. This model is trained using the code with SHA d4ab35a, is there any reason it should misbehave at inference time with the latest code?
  3. Is there a decoding step required when I get the English output, as was required in case of sentencepiece?
    Thanks for your patience

(Vardaan Pahuja) #4

UPDATE: Issue resolved. There was an issue at my end as I was using a custom dataset, which didn’t have the attribute ‘data_type’ defined. It works reasonably well now

(Faburu) #5

@vardaan - Are you using the SentencePiece model? Could you please post the code you use in order to convert the text?

I am having difficulty understanding how to use it with python translate.py ...

I have the following text in sample.txt;
In every dark hour of our national life a leadership of frankness and vigor has met with that understanding and support of the people themselves which is essential to victory. I am convinced that you will again give that support to leadership in these critical days.

Then I will run;
python3 translate.py -model averaged-10-epoch.pt -src sample.txt -output sample.de.txt -verbose

And the output is;

SENT 1: ('In', 'every', 'dark', 'hour', 'of', 'our', 'national', 'life', 'a', 'leadership', 'of', 'frankness', 'and', 'vigor', 'has', 'met', 'with', 'that', 'understanding', 'and', 'support', 'of', 'the', 'people', 'themselves', 'which', 'is', 'essential', 'to', 'victory.', 'I', 'am', 'convinced', 'that', 'you', 'will', 'again', 'give', 'that', 'support', 'to', 'leadership', 'in', 'these', 'critical', 'days.')
PRED 1: ▁Aber ▁wer ▁will ▁eigentlich ▁gar ▁nicht ▁mehr ? ▁Aber ▁wer ▁will ▁nicht ?
PRED SCORE: -20.2557
PRED AVG SCORE: -1.5581, PRED PPL: 4.7499

Any help is appreciated.

(Vardaan Pahuja) #6

Please see my repo https://github.com/vardaan123/ParaNet for details. You have to install sentencepiece to encode it

(Vincent Nguyen) #7

if you guys check this fodler https://github.com/OpenNMT/OpenNMT-tf/tree/master/scripts/wmt
you will see a script made for onmt-tf
but you can easily adapt to onmt-py.