German -> English pre-trained model issues


(Vardaan Pahuja) #1

For the pre-trained German -> English model, I get a lot of <unk> in the translation output even for the training sentences. I saw the training pre-processing script
which uses moses tokenization, lowercasing followed by BPE encoding, but the results worsen with use of BPE in my case. Also, I get even common german words (like Fußball) as it is in my English output.

  1. I am using python -c <code_file> < <input_file> > output_file. Should I also give some vocab file as input ?
  2. This model is trained using the code with SHA d4ab35a, is there any reason it should misbehave at inference time with the latest code?
  3. Is there a decoding step required when I get the English output, as was required in case of sentencepiece? I suppose sed -r 's/(@@ )|(@@ ?$)//g' should be used for decoding?
    Thanks for your patience

(Vardaan Pahuja) #2

I’d be grateful if you could give me some help on this issue of how to preprocess the data for input to German -> English translation model (PyTorch)