Extra token produced


I converted a ModernMT model (Fairseq) to CT2. All is working quite properly, whatever the quantization used.

For information: I do not know if this is important, but in my application, the source sentences contain extra information at the beginning, that should be used by the model to produce a tuned translation of the rest of the sentence. This is working well with the original ModernMT model.

Problem: especially on short sentences, CT2 is often producing an extra token at the beginning of the output (usually a punctuation).

I tried to play with beam_size, length_penalty, coverage_penalty parameters without success.

I wonder if this may be due to the fact that ModernMT seems not using a BOS token. Only an EOS token.
Thus, I tried with this config.json file:

  "add_source_bos": false,
  "add_source_eos": false,
  "bos_token": "",
  "decoder_start_token": "<EOS>_",
  "eos_token": "<EOS>_",
  "layer_norm_epsilon": null,
  "unk_token": "<UNK>_"

Any idea on a solution to avoid these extra tokens to be produced?


Models trained with Fairseq usually require the EOS token at the end of the source input.

Do you add this token in the input before running the model? Alternatively you could enable add_source_eos in the configuration.

Yes, I add it.
More precisely: I’m using the ModernMT preprocessing that produces the EOS token.


Do you have alot of short sentences in your training/validation sets?

I experienced that in the past, because the model was learning to do a certain sentence length as I was providing alot of sentences with more or less the same length.

No problem with the training set: it was working properly with the original ModernMT model. The problem occurs with CT2 using the converted model.

Can you show how you are calling the MMT preprocessing and then how you call translate_batch in CT2?

First, I need textencoder.py:

MMT encoding code (cloned from MMT):

sub_dict = SubwordDictionary.load("./engines/"+ENGINE+"/model.vcb")
indexes = sub_dict.encode_line(input_text, line_tokenizer=sub_dict.tokenize, add_if_not_exist=False)

Remark: encode_line is inherited from fairseq.data.Dictionary.

At this step, MMT is directly building a Tensor. But, CT2 needs tokens. In a first attempt, I wanted to keep the exact processing of MMT. So, I need to convert back this Tensor to a token list:

            indexes = indexes.long()
            tokens = [sub_dict.symbols[idx] for idx in indexes]

At this step, the token list is properly ended with the "<EOS>_" token.

Then, I send it to CT2:

            results = translator.translate_batch([tokens]
                                                 # ,length_penalty=1
                                                 # ,coverage_penalty=0
                                                 # ,repetition_penalty=1