Extra token produced

Etienne38 · July 26, 2023, 9:13am

Hi,

I converted a ModernMT model (Fairseq) to CT2. All is working quite properly, whatever the quantization used.

For information: I do not know if this is important, but in my application, the source sentences contain extra information at the beginning, that should be used by the model to produce a tuned translation of the rest of the sentence. This is working well with the original ModernMT model.

Problem: especially on short sentences, CT2 is often producing an extra token at the beginning of the output (usually a punctuation).

I tried to play with beam_size, length_penalty, coverage_penalty parameters without success.

I wonder if this may be due to the fact that ModernMT seems not using a BOS token. Only an EOS token.
Thus, I tried with this config.json file:

{
  "add_source_bos": false,
  "add_source_eos": false,
  "bos_token": "",
  "decoder_start_token": "<EOS>_",
  "eos_token": "<EOS>_",
  "layer_norm_epsilon": null,
  "unk_token": "<UNK>_"
}

Any idea on a solution to avoid these extra tokens to be produced?

guillaumekln · July 26, 2023, 9:27am

Hi,

Models trained with Fairseq usually require the EOS token at the end of the source input.

Do you add this token in the input before running the model? Alternatively you could enable add_source_eos in the configuration.

Etienne38 · July 26, 2023, 9:28am

Yes, I add it.
More precisely: I’m using the ModernMT preprocessing that produces the EOS token.

SamuelLacombe · July 26, 2023, 10:57am

Hello,

Do you have alot of short sentences in your training/validation sets?

I experienced that in the past, because the model was learning to do a certain sentence length as I was providing alot of sentences with more or less the same length.

Etienne38 · July 26, 2023, 11:08am

No problem with the training set: it was working properly with the original ModernMT model. The problem occurs with CT2 using the converted model.

guillaumekln · July 26, 2023, 12:56pm

Can you show how you are calling the MMT preprocessing and then how you call translate_batch in CT2?

Etienne38 · July 26, 2023, 2:25pm

First, I need textencoder.py:

github.com

modernmt/modernmt/blob/master/src/decoder-neural/src/main/python/mmt/textencoder.py

import collections
import logging
import multiprocessing
import os
import re
import tempfile
from itertools import chain

import cachetools
import torch
from fairseq.data import Dictionary

PAD = "<PAD>_"
EOS = "<EOS>_"
UNK = "<UNK>_"
RESERVED_TOKENS = ["<Lua_Heritage>", PAD, EOS, UNK]
PAD_ID = RESERVED_TOKENS.index(PAD)  # Normally 1
EOS_ID = RESERVED_TOKENS.index(EOS)  # Normally 2
UNK_ID = RESERVED_TOKENS.index(UNK)  # Normally 3

This file has been truncated. show original

MMT encoding code (cloned from MMT):

sub_dict = SubwordDictionary.load("./engines/"+ENGINE+"/model.vcb")
indexes = sub_dict.encode_line(input_text, line_tokenizer=sub_dict.tokenize, add_if_not_exist=False)

Remark: encode_line is inherited from fairseq.data.Dictionary.

At this step, MMT is directly building a Tensor. But, CT2 needs tokens. In a first attempt, I wanted to keep the exact processing of MMT. So, I need to convert back this Tensor to a token list:

            indexes = indexes.long()
            print("IDX="+str(indexes))
            tokens = [sub_dict.symbols[idx] for idx in indexes]
            print("TOK="+str(tokens))

At this step, the token list is properly ended with the "<EOS>_" token.

Then, I send it to CT2:

            results = translator.translate_batch([tokens]
                                                 ,beam_size=5
                                                 # ,length_penalty=1
                                                 # ,coverage_penalty=0
                                                 # ,repetition_penalty=1
                                                 )
            print("TRANS="+str(results[0].hypotheses[0]))