German -> English pre-trained model issues

vardaan · June 5, 2018, 4:47pm

For the pre-trained German -> English model, I get a lot of <unk> in the translation output even for the training sentences. I saw the training pre-processing script https://github.com/pytorch/fairseq/blob/master/data/prepare-iwslt14.sh
which uses moses tokenization, lowercasing followed by BPE encoding, but the results worsen with use of BPE in my case. Also, I get even common german words (like Fußball) as it is in my English output.
Questions:

I am using python apply_bpe.py -c <code_file> < <input_file> > output_file. Should I also give some vocab file as input ?
This model is trained using the code with SHA d4ab35a, is there any reason it should misbehave at inference time with the latest code?
Is there a decoding step required when I get the English output, as was required in case of sentencepiece? I suppose sed -r 's/(@@ )|(@@ ?$)//g' should be used for decoding?
Thanks for your patience

vardaan · June 9, 2018, 3:09am

I’d be grateful if you could give me some help on this issue of how to preprocess the data for input to German -> English translation model (PyTorch)