Poor translation result with English->German and German->English pre-trained models

vardaan · May 31, 2018, 11:09pm

I am trying to use the pre-trained models from OpenNMT, but the translation quality is very poor
http://opennmt.net/Models-py/
Here is my code

perl tools/tokenizer.perl -a -no-escape -l en -q < sample_sentences.txt > sample_sentences.atok
python translate.py -gpu 0 -model available_models/averaged-10-epoch.pt -src sample_sentences.atok -verbose -output sample_sentences.de.atok

The output German translation for the sentence The cat sat on the mat is ▁The cat ?
Input: Hello, how are you? Output: ▁Nein , ▁viel ▁mehr !
Input: How many horses are there in the stable? Output: ▁Ganz ▁einfach .
I even tried some training sentences from WMT like
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
Output: ▁Ganz ▁einfach ▁nur : ▁Das ▁Parlament ▁hat ▁sich ▁in ▁seine m ▁ganz en ▁Haus ▁versteckt .
Please enlighten me where am I wrong. The model claims to have a decent BLEU score of >25

guillaumekln · June 1, 2018, 7:51am

You should apply the same tokenization as used during the training. In that case, apply the SentencePiece model that is included in the model archive.

vardaan · June 2, 2018, 2:49am

For the pre-trained German -> English model, I get a lot of <unk> in the translation output even for the training sentences. I saw the training pre-processing script https://github.com/pytorch/fairseq/blob/master/data/prepare-iwslt14.sh
which uses moses tokenization, lowercasing followed by BPE encoding, but the results worsen with use of BPE in my case
Questions:

I am using python apply_bpe.py -c <code_file> < <input_file> > output_file. Should I also give some vocab file as input ?
This model is trained using the code with SHA d4ab35a, is there any reason it should misbehave at inference time with the latest code?
Is there a decoding step required when I get the English output, as was required in case of sentencepiece?
Thanks for your patience

vardaan · June 2, 2018, 5:53am

UPDATE: Issue resolved. There was an issue at my end as I was using a custom dataset, which didn’t have the attribute ‘data_type’ defined. It works reasonably well now

faburu · November 19, 2018, 1:16pm

@vardaan - Are you using the SentencePiece model? Could you please post the code you use in order to convert the text?

I am having difficulty understanding how to use it with python translate.py ...

I have the following text in sample.txt;
In every dark hour of our national life a leadership of frankness and vigor has met with that understanding and support of the people themselves which is essential to victory. I am convinced that you will again give that support to leadership in these critical days.

Then I will run;
python3 translate.py -model averaged-10-epoch.pt -src sample.txt -output sample.de.txt -verbose

And the output is;

SENT 1: ('In', 'every', 'dark', 'hour', 'of', 'our', 'national', 'life', 'a', 'leadership', 'of', 'frankness', 'and', 'vigor', 'has', 'met', 'with', 'that', 'understanding', 'and', 'support', 'of', 'the', 'people', 'themselves', 'which', 'is', 'essential', 'to', 'victory.', 'I', 'am', 'convinced', 'that', 'you', 'will', 'again', 'give', 'that', 'support', 'to', 'leadership', 'in', 'these', 'critical', 'days.')
PRED 1: ▁Aber ▁wer ▁will ▁eigentlich ▁gar ▁nicht ▁mehr ? ▁Aber ▁wer ▁will ▁nicht ?
PRED SCORE: -20.2557
PRED AVG SCORE: -1.5581, PRED PPL: 4.7499

Any help is appreciated.

vardaan · November 20, 2018, 4:57am

Please see my repo https://github.com/vardaan123/ParaNet for details. You have to install sentencepiece to encode it

vince62s · November 20, 2018, 1:18pm

if you guys check this fodler https://github.com/OpenNMT/OpenNMT-tf/tree/master/scripts/wmt
you will see a script made for onmt-tf
but you can easily adapt to onmt-py.
Cheers.

KishorKP · June 18, 2019, 1:27pm

Hi Vardaan,

Do you have a pretrained torch model(*.t7) for german  to english ?

Any help here would be really appreciated.

Thank You,
Kishor.

vince62s · June 18, 2019, 6:24pm

we no longer support the Lua version, please give a try to the -py pretrained.

KishorKP · June 19, 2019, 6:21am

Hi,

I am facing issues with OpenNMT-py while loading the tokenizer.
The error is as follows:

administrator@:~/OpenNMT/OpenNMT-py$ python3 server.py --ip “0.0.0.0” --port “7785” --url_root “/translator” --config "./available_models/conf.json"
Pre-loading model 1
[2019-06-18 12:10:12,621 INFO] Loading model 1
[2019-06-18 12:10:19,179 INFO] Loading tokenizer
Traceback (most recent call last):
** File “server.py”, line 123, in **
** debug=args.debug)**
** File “server.py”, line 24, in start**
** translation_server.start(config_file)**
** File “/home/administrator/OpenNMT/OpenNMT-py/onmt/translate/translation_server.py”, line 102, in start**
** self.preload_model(opt, model_id=model_id, kwargs)
** File “/home/administrator/OpenNMT/OpenNMT-py/onmt/translate/translation_server.py”, line 140, in preload_model**
** model = ServerModel(opt, model_id, model_kwargs)
** File “/home/administrator/OpenNMT/OpenNMT-py/onmt/translate/translation_server.py”, line 227, in init **
** self.load()**
** File “/home/administrator/OpenNMT/OpenNMT-py/onmt/translate/translation_server.py”, line 319, in load**
** tokenizer = pyonmttok.Tokenizer(mode, tokenizer_params)
RuntimeError: basic_filebuf::underflow error reading the file: iostream error
administrator@:~/OpenNMT/OpenNMT-py$

How do I resolve this issue ? Please assist me in resolving this issue.

Thank you,
Kishor.

vince62s · June 19, 2019, 10:14am

did you follow the tuto ?
post your json config file.

KishorKP · June 19, 2019, 10:26am

Hi,

I had followed the tutorials. Please find the config file used :
{
“models_root”: “./available_models”,
“models”: [
{
“id”: 1,
“model”: “iwslt-brnn2.s131_acc_62.71_ppl_7.74_e20.pt”,
“timeout”: 600,
“on_timeout”: “to_cpu”,
“load”: true,
“opt”: {
“gpu”: -1,
“beam_size”: 5
},
“tokenizer”: {
“type”: “pyonmttok”,
“mode”: “str”,
“params”: {
“bpe_model_path”:"",
“vocabulary_path”:"",
“vocabulary_threshold”:0,
“sp_model_path”:"",
“sp_nbest_size”:0,
“sp_alpha”:0.1,
“joiner”:“￭”,
“joiner_annotate”:false,
“joiner_new”:false,
“spacer_annotate”:false,
“spacer_new”:false,
“case_feature”:false,
“case_markup”:false,
“no_substitution”:false,
“preserve_placeholders”:false,
“preserve_segmented_tokens”:false,
“segment_case”:false,
“segment_numbers”:false,
“segment_alphabet_change”:false,
“segment_alphabet”:[]
}
}
},{
“model”: “model_0.light.pt”,
“timeout”: -1,
“on_timeout”: “unload”,
“model_root”: “…/other_models”,
“opt”: {
“batch_size”: 1,
“beam_size”: 10
}
}
]
}

Kind Regards,
Kishor.