M2m100 sentences cut off during translation

Jourdelune · March 15, 2022, 9:00pm

Hi, I’ve been using m2m100 for some time and I’ve noticed that some sentences are not fully translated. Indeed it seems that some parts of sentences are cut. For example if I take this sentence:

So I used to use RITA because it allowed me to translate one language from say an English channel to a Spanish channel and vice versa. Would that be possible with this bot? I am trying to find an alternative now that RITA went kaput.

send back this translation:

Donc j'ai utilisé RITA parce que cela m'a permis de traduire une langue de dire un canal anglais à un canal espagnol et vice versa.

However, if we replace the . with a comma, it translates the whole sentence up to the question, but again, it does not completely translate the sentence.

So I used to use RITA because it allowed me to translate one language from say an English channel to a Spanish channel and vice versa, Would that be possible with this bot? I am trying to find an alternative now that RITA went kaput.

send back this translation:

Donc j'ai utilisé RITA parce que cela m'a permis de traduire une langue de dire un canal anglais à un canal espagnol et vice versa, Est-ce possible avec ce bot?

But be careful this bug happens only for some sentences, most of them are correctly translated, I looked at all the settings and I didn’t find any that fixes the problem, here is the code to use:

import os
os.environ["OMP_NUM_THREADS"] = "2"
os.environ["CT2_USE_EXPERIMENTAL_PACKED_GEMM"] = "1"

import ctranslate2
import sentencepiece as spm
import time

translator = ctranslate2.Translator("end_output", device="cpu", inter_threads=1, intra_threads=1, compute_type="auto")
s = spm.SentencePieceProcessor(model_file='spm.128k.model')

string = input("Entrez un texte: ")
a = string
string = ["__en__"] + s.encode(string, out_type=str)


value = translator.translate_batch(
    [string],
    target_prefix=[["__fr__"]],
)


print("start")
for i in range(5):
  time1 = time.time()
  value = translator.translate_batch(
      [string],
      target_prefix=[["__fr__"]],
      return_scores=False,
      max_decoding_length=2000,
      max_input_length=0
  )
  
  print(s.decode(value[0].hypotheses[0][1:]))
  print(time.time()-time1)

I specify that this problem is common to hugging face, so it is not related to ctranslate2.

Thanks for your future answer

ArtanisTheOne · March 15, 2022, 9:48pm

Adding more context; eg;

Changing the translate_batch args to increase the max length, max decoded length, and even tried with 10 alternatives to see if it ever would use the last sentence.

But the last sentence was never translated once, it is encoded correctly, and texts even longer than this sentence do translate;

But this one sentence just refuses to translate
M2M1.2B CT2 model

And if there is no way around this, would it be viable to split a message’s text into chunks and translate one by one.

guillaumekln · March 16, 2022, 8:47am

Hi,

Like most models, M2M-100 is trained on sentences, not paragraphs. So it is required to split the paragraphs into sentences before using the model.

This step is currently out of the scope of the CTranslate2 project. You can check the NLTK sentence splitter for example: NLTK :: nltk.tokenize package