Bad results from DE-EN Pretrained models

ktoetotam · June 25, 2018, 10:55am

Hallo,

I want to use pre-trained models for DE<->EN. I downloaded the following models:

transformer-ende-wmt-pyOnmt
baseline-brnn2.s131_acc_62.71_ppl_7.74_e20.pt
iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt

I am trying to run them in a straightforward way, for example:

python translate.py -model ~/Downloads/transformer-ende-wmt-pyOnmt/averaged-10-epoch.pt -src data/test.txt -replace_unk -verbose

The result is very bad:

[2018-06-25 12:53:40,916 INFO]
SENT 1: (‘Orlando’, ‘Bloom’, ‘and’, ‘Miranda’, ‘Kerr’, ‘still’, ‘love’, ‘each’, ‘other’)
PRED 1: ▁Nein , ▁viel leicht ▁nicht !
PRED SCORE: -10.4686

[2018-06-25 12:53:40,916 INFO]
SENT 2: (‘Actors’, ‘Orlando’, ‘Bloom’, ‘and’, ‘Model’, ‘Miranda’, ‘Kerr’, ‘want’, ‘to’, ‘go’, ‘their’, ‘separate’, ‘ways’, ‘.’)
PRED 2: ▁Seh r ▁interessant ▁und ▁interessant .
PRED SCORE: -12.3617

[2018-06-25 12:53:40,917 INFO]
SENT 3: (‘However’, ‘,’, ‘in’, ‘an’, ‘interview’, ‘,’, ‘Bloom’, ‘has’, ‘said’, ‘that’, ‘he’, ‘and’, ‘Kerr’, ‘still’, ‘love’, ‘each’, ‘other’, ‘.’)
PRED 3: ▁Seh r ▁interessant ▁ist ▁auch ▁die ▁Tatsache , ▁dass ▁das ▁Ganze ▁noch ▁nicht ▁vollständig ▁umgesetzt ▁wurde .
PRED SCORE: -24.7087

It does not make sense entirely. The other models are kind of at least trying to translate close to the text but are still horrible and mistranslate most of it.

What am I missing?

Thanks in advance!

guillaumekln · June 25, 2018, 11:25am

Hello,

You should preprocess the files to translate using the SentencePiece model included in the model archive. See:

ktoetotam · June 27, 2018, 8:45am

Thank you! The tokenization helped for the transformer model but the German-English model is still very bad:

python translate.py -model ~/Downloads/iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt -src data/test_de.txt -replace_unk -verbose

[2018-06-27 10:39:53,631 INFO]
SENT 1: (‘Orlando’, ‘Bloom’, ‘ist’, ‘der’, ‘Sohn’, ‘der’, ‘Kinderbuchautorin’, ‘Sonia’, ‘Copeland-Bloom.[1]’, ‘Sein’, ‘rechtlicher’, ‘Vater’, ‘ist’, ‘der’, ‘südafrikanische’, ‘Journalist’, ‘und’, ‘Bürgerrechtler’, ‘Harry’, ‘Bloom.’, ‘Erst’, ‘im’, ‘Erwachsenenalter’, ‘erfuhr’, ‘er,’, ‘dass’, ‘sein’, ‘biologischer’, ‘Vater’, ‘sein’, ‘Taufpate’, ‘Colin’, ‘Stone,’, ‘ein’, ‘Schuldirektor’, ‘und’, ‘enger’, ‘Freund’, ‘der’, ‘Familie,’, ‘ist.’)
PRED 1: Orlando Bloom is the Sohn of the Kinderbuchautorin Sonia Sein rechtlicher Vater , who is the südafrikanische Journalist and Bürgerrechtler Harry Erst in Erwachsenenalter , that his biological Vater sein his Taufpate Colin Stone, ein a Schuldirektor and enger der of the Familie, galaxy .

I tried to lowercase German but it slightly improves the results but it is still unsatisfactory.
Any preprocessing steps I am missing here?

guillaumekln · June 27, 2018, 9:20am

The models have indeed different data preprocessing. Look at the “Corpus prep” column:

English-German links to the preprocessed training corpus that contains the SentencePiece model
German-English links to the IWSLT14 preparation script

ktoetotam · June 27, 2018, 10:38am

Thanks! The SentencePiece model worked fine.
FYI, The link to the preparation script for IWSLT14 does not work in the table.

I ran the preparation script on the IWSLT14 data and then I also ran the Moses scripts on my input data as:

perl tokenizer.perl -l de -threads 8 < test_de.txt > /tmp/tmp
perl lowercase.perl < /tmp/tmp > original_text_tokenized.txt

Unfortunately got unsatisfactory results on my data:

[2018-06-27 12:13:12,434 INFO] 
SENT 2: ('sehr', 'geehrte', 'damen', 'und', 'herren', '!', 'bei', 'einem', 'lieferantenbesuch', 'in', 'china', 'habe', 'ich', 'von', 'einem', 'geschäftspartner', 'im', 'rahmen', 'einer', 'abendveranstaltung', 'drei', 'flaschen', 'rotwein', 'im', 'wert', 'von', 'ca.', 'jew', '.', '30.-', '€', 'geschenkt', 'bekommen', '.', 'diese', 'habe', 'ich', 'aus', 'gründen', 'der', 'wertschätzung', 'und', 'höflichkeit', 'angenommen', 'meine', 'frage', 'ist', 'nun', ':', 'wie', 'muss', 'ich', 'mich', 'nun', 'weiter', 'verhalten', '?', 'kann', 'ich', 'die', 'einzelnen', 'flaschen', 'meinen', 'mitarbeitern', 'als', 'weihnachtsgeschenk', 'überreichen', '?', 'vielen', 'dank', 'für', 'ihre', 'auskunft', '.', 'mit', 'freundlichen', 'grüßen')
PRED 2: ladies and gentlemen , ladies and gentlemen , in china , i have a business business in china in the rahmen of three bottles of red wine in the wert of about jew . and so , for reasons that i &apos;ve been given the appreciation and höflichkeit , i &apos;ve got my question now : how am i going to have to behave ? can i have the single bottles of my staff as a weihnachtsgeschenk person ? thank you very much .
PRED SCORE: -75.9947

I tried applying it to the preprocessed IWSLT data that were created by the script you linked to see if it is out-of-domain problem and it is still not good but a bit better:

[2018-06-27 12:26:26,685 INFO]
SENT 7: (‘die’, ‘erste’, ‘dieser’, ‘fallen’, ‘ist’, ‘ein’, ‘widerstreben’, ‘,’, ‘komplexität’, ‘zuzugeben’, ‘.’)
PRED 7: the first of these fall is a widerstreben , komplexität .
PRED SCORE: -7.6431

[2018-06-27 12:26:26,685 INFO]
SENT 9: (‘ich’, ‘denke’, ‘,’, ‘es’, ‘gibt’, ‘eine’, ‘bestimmte’, ‘bedeutung’, ‘,’, ‘auf’, ‘die’, ‘wir’, ‘es’, ‘beschränken’, ‘könnten’, ‘,’, ‘aber’, ‘im’, ‘großen’, ‘und’, ‘ganzen’, ‘ist’, ‘das’, ‘etwas’, ‘,’, ‘das’, ‘wir’, ‘aufgeben’, ‘werden’, ‘müssen’, ‘,’, ‘und’, ‘wir’, ‘werden’, ‘die’, ‘komplizierte’, ‘sichtweise’, ‘annehmen’, ‘müssen’, ‘darauf’, ‘,’, ‘was’, ‘wohlbefinden’, ‘ist’, ‘.’)
PRED 9: i think there 's a certain meaning that we could keep it on , but in the big , and all of this is something that we need to give up , and we will have the complicated view of what is well-being .
PRED SCORE: -23.9721

Can it be that this is just how the model is? Is it the best it gets? It seems poor for IWSLT14.

guillaumekln · June 27, 2018, 10:44am

Looks like it is fairly small model. Do you reproduce the BLEU score reported in the table on the IWSLT14 task? If so, that is the performance you should expect from this model.

If you are looking for better results, you should probably train your own using a Transformer configuration (as used for the English-German direction that achieves competitive results).

ktoetotam · June 27, 2018, 10:47am

Yes! I think this is it. I will try to train my own transformer model. Thanks for you quick reply!

guillaumekln · July 3, 2018, 11:05am

Thanks, I just fixed that.

jpwp · August 16, 2018, 5:26am

Hi, I wasn’t sure where to mention this, but I just downloaded the archive for the preprocessed training corpus and was having a look at the files and noticed something strange.

I presume train.en is the input and train.de is the translation to German?

After the sixth line in each of these files, the sentences don’t have any relationship to each other. Is this expected? I had assumed line X in one file would map to line X in the other.

For example line 7 of train.en:

▁Trans l ator ▁Internet ▁is ▁a ▁Tool bar ▁for ▁MS ▁Internet ▁Explorer .

line 7 of train.de:

▁A CD See ▁9 ▁Photo ▁Manager ▁Organ ize ▁your ▁photos . ▁Share ▁your ▁world .

I might be mistaken about how the data is used, but wanted to bring it up in case it’s causing people problems trying to reproduce the model.

guillaumekln · August 16, 2018, 7:36am

Hello,

There may be some instances that are badly aligned, but it should be a small share of the total dataset size.

akhisud · December 27, 2018, 6:36am

@guillaumekln: still unsure how to preprocess text for DE-EN pretrained pytorch MT model iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt given here http://opennmt.net/Models-py/ under the section “German->English”.
Are these the only 2 instructions for pre-processing:

Using these instructions gave me:

SENT 1: (‘ich’, ‘bin’, ‘ledig’)
PRED 1: i 'm a ledig .
PRED SCORE: -2.5008
PRED AVG SCORE: -0.5002, PRED PPL: 1.6490

And that’s quite poor translation for a simple sentence for a model which has BLEU ~ 30. Wondering if I’m missing something here, and you’ve been able to reproduce BLEU using the pretrained model for DE-EN?

guillaumekln · January 2, 2019, 1:05pm

That looks about right. I did reproduce the BLEU score on the IWSLT test set. However, I don’t think we can expect good generic results with this model and training dataset as they are both fairly small.