Transformer model is generating empty lines when using Sentencepiece model

I used the SentencePiece model as a tokenizer. When I encoded src text file using SPM encoder, everything is fine.

Command:
spm_encode --model=eng.model --output_format=piece input.txt --output input_tok.txt --extra_options=bos:eos

While using translation command on the model, few lines translated as empty, and it’s not that those sentences are small 1 word, phrases. These sentences were complete sentences.

Command:
onmt_translate -model Eng_Fr_Model.pt -src input_tok.txt -output fr_output_tok.txt -replace_unk -verbose

I am trying English-French Machine Translation on the Europarl dataset.
Any suggestions, how to resolve this??

Can you show an example input (tolenized) that gives you an empty translation?

@BramVanroy Here is the input file after tokenization.

And this is output, after translation.

@guillaumekln please help.

I suggest to not use these options when training with OpenNMT-py. The framework already injects these tokens in the data.

Hi,
I’m experiencing the same issue with transformers. In my case I’m using BPE and I’m not using bos or eos tokens. Some lines are translated as empty. Any guess?

Hi everyone,
Does anyone have an explanation and a solution to this issue in the meantime?

Thanks,
Arda

Hi,

When no length penalty or normalization are applied, the empty output can have a better score than any other predictions.

There are multiple solutions in OpenNMT-py:

  • Set a minimum decoding length: -min_length 1
  • Set a length penalty: -length_penalty wu -alpha 0.1

Thanks for your reply Guillaume!

Just as extra information for anyone who might check this thread in the future I will also share my experience with both configurations:

  • The issue was solved with -min_length 1 and the BLEU scores were as expected.
  • I still had some blank lines (no predictions) with the length penalty (wu, alpha=0.1).

Regarding the length penalty, I probably used a bad value for -alpha. You can also try other values like 0.6 (the default in Tensor2Tensor).

But -min_length is probably the easiest solution here.

1 Like