I am trying to regenerate results using pre-trained English -> German (WMT) model included in pyTorch documentation. When I use pre-trained model to translate test.en (A file included in WMT data archive obtained using the link given under ‘Corpus Prep’ column of English-German (WMT) table i.e https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp.tar.gz)
I see unsatisfactory resulsts on the test.de when I run following command:
python translate.py -model transformer-ende-wmt-pyOnmt/averaged-10-epoch.pt -src data/test.en -tgt data/test.de -verbose
SENT 1: ('▁28', '-', 'Y', 'ear', '-', 'O', 'ld', '▁Chef', '▁Found', '▁Dead', '▁at', '▁San', '▁Francisco', '▁Mal', 'l')
PRED 1: ▁28 - Jahr - O ld ▁Chef ▁Found ▁Dead
PRED SCORE: -7.7273
GOLD 1: ▁28 - jährige r ▁Koch ▁in ▁San ▁Francisco ▁Mal l ▁to t ▁auf gefunden
GOLD SCORE: -25.5761
SENT 2: ('▁A', '▁28', '-', 'year', '-', 'old', '▁chef', '▁who', '▁had', '▁recently', '▁moved', '▁to', '▁San', '▁Francisco', '▁was', '▁found', '▁dead', '▁in', '▁the', '▁sta', 'ir', 'well', '▁of', '▁a', '▁local', '▁mall', '▁this', '▁week', '.')
PRED 2: ▁Ein ▁28 - jährige r ▁Küchen chef , ▁der ▁vor ▁kurze m ▁nach ▁San ▁Francisco ▁ zog , ▁wurde ▁diese ▁Woche ▁im ▁Trepp en haus ▁eines ▁lokale n ▁Einkaufszentrum s ▁to t ▁auf gefunden .
PRED SCORE: -13.8766
GOLD 2: ▁Ein ▁28 - jährige r ▁Koch , ▁der ▁vor ▁kurze m ▁nach ▁San ▁Francisco ▁gezogen ▁ist , ▁wurde ▁im ▁Trepp en haus ▁eines ▁ örtlich en ▁Einkauf zentrum s ▁to t ▁auf gefunden .
GOLD SCORE: -34.4022
I have several question regarding the pretrained model:
1. I want to know on which data exactly the model was trained on and on which data the model was tested. I am assuming that train.de file is used for training given by the link under ‘Corpus Prep’ column of English -> German (WMT) pretrained model table. I am right about this? And is model tested on test.de file? if so what is the expected BLEU score for test.de?
3. Is file test.de and train.de contains the preprocessed data? or do I need to perform the preprocessing steps for feeding it to translate.py? If so, what are the required steps to perform preprocessing
4. I also want to run my model on WMT14 and WMT17 dataset to check if I achieve the BLEU score of 26.89 and 28.09 respectively. So, when testing on these datasets, do I need to perform any preprocessing steps?
5. I have seen in other post that using SentencePiece model for tokenization of test data increase the model accuracy. However, I am unable to find any documentation of how to use SentencePiece.model file to perform tokenization Can you please elaborate how to use this file? (I am using PyTorch implementation of OpenNMT)
6. In English -> German (WMT) pretrained model table, link under ‘Translation Parameters’ column points to the documentation with title ‘How do I use the Transformer model?’, which lists the training parameters but not the translation parameters. Is it correct link?
Thanks in advance!