For the pre-trained German -> English model, I get a lot of <unk>
in the translation output even for the training sentences. I saw the training pre-processing script https://github.com/pytorch/fairseq/blob/master/data/prepare-iwslt14.sh
which uses moses tokenization, lowercasing followed by BPE encoding, but the results worsen with use of BPE in my case. Also, I get even common german words (like Fußball) as it is in my English output.
Questions:
- I am using
python apply_bpe.py -c <code_file> < <input_file> > output_file
. Should I also give some vocab file as input ? - This model is trained using the code with SHA d4ab35a, is there any reason it should misbehave at inference time with the latest code?
- Is there a decoding step required when I get the English output, as was required in case of
sentencepiece
? I supposesed -r 's/(@@ )|(@@ ?$)//g'
should be used for decoding?
Thanks for your patience