I converted a ModernMT model (Fairseq) to CT2. All is working quite properly, whatever the quantization used.
For information: I do not know if this is important, but in my application, the source sentences contain extra information at the beginning, that should be used by the model to produce a tuned translation of the rest of the sentence. This is working well with the original ModernMT model.
Problem: especially on short sentences, CT2 is often producing an extra token at the beginning of the output (usually a punctuation).
I tried to play with beam_size, length_penalty, coverage_penalty parameters without success.
I wonder if this may be due to the fact that ModernMT seems not using a BOS token. Only an EOS token.
Thus, I tried with this config.json file:
Do you have alot of short sentences in your training/validation sets?
I experienced that in the past, because the model was learning to do a certain sentence length as I was providing alot of sentences with more or less the same length.
Remark: encode_line is inherited from fairseq.data.Dictionary.
At this step, MMT is directly building a Tensor. But, CT2 needs tokens. In a first attempt, I wanted to keep the exact processing of MMT. So, I need to convert back this Tensor to a token list:
indexes = indexes.long()
print("IDX="+str(indexes))
tokens = [sub_dict.symbols[idx] for idx in indexes]
print("TOK="+str(tokens))
At this step, the token list is properly ended with the "<EOS>_" token.