The Helsinki-NLP/opus-mt-mul-en template cross-references <pad> with ctranslate2

Jourdelune · July 5, 2022, 4:44pm

Hello, I’m trying to use the multi language model of opus ml only I noticed that with ctranslate2, it returns <pad> during each translation if you use the model of mul-en direction. In the other direction (en-mul) everything works correctly. Here is an example:

import ctranslate2
import transformers

translator = ctranslate2.Translator("/home/jourdelune/opus-mt/")
tokenizer = transformers.AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-mul-en")

source = tokenizer.convert_ids_to_tokens(tokenizer.encode(">>fra<<Bonjour le monde"))
results = translator.translate_batch([source])
target = results[0].hypotheses[0]

print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))

output:

<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>

Only the model with hugging face works:

from transformers import MarianMTModel, MarianTokenizer

src_text = [
    ">>fra<<Bonjour le monde",
]

model_name = "Helsinki-NLP/opus-mt-mul-en"
tokenizer = MarianTokenizer.from_pretrained(model_name)


model = MarianMTModel.from_pretrained(model_name)
translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
print([tokenizer.decode(t, skip_special_tokens=True) for t in translated])

output:

['Welcome to the World']

I converted the model with this command:
ct2-transformers-converter --model Helsinki-NLP/opus-mt-mul-en --output_dir opus-mt

The strange thing is that the translation works with the model used in the opposite direction (en-mul):

I converted the model with this command:

ct2-transformers-converter --model Helsinki-NLP/opus-mt-en-mul --output_dir opus-mt-2

If anyone has any idea why I’m experiencing this bug, don’t hesitate!

guillaumekln · July 5, 2022, 5:36pm

Hi,

Thank you for reporting this issue. I can reproduce it.

We should actually remove this <pad> token that was added by Transformers and is not used by the original Opus-MT model. This token is only used to start the decoder from a zero embedding, but we have a dedicated option for that.

I will fix the issue tomorrow.

Note that you should not use the language token >>fra<< for this model, as stated in the Transformers documentation:

Note that if a model is only multilingual on the source side, like Helsinki-NLP/opus-mt-roa-en, no language codes are required.

Jourdelune · July 5, 2022, 5:41pm

Thank you very much! Yes indeed it is not useful to put the token at the beginning of the sentence, always read the doc until the end ahah, thanks for pointing it out .

guillaumekln · July 6, 2022, 8:22am

github.com/OpenNMT/CTranslate2

Remove <pad> token when converting MarianMT models

OpenNMT:master ← guillaumekln:rm-pad-token-marianmt

opened 08:22AM - 06 Jul 22 UTC

guillaumekln

+53 -1

This token is added by Hugging Face's Transformers to start the decoder from a z…ero embedding, but we already have a dedicated internal option `start_from_zero_embedding` for this purpose. We remove this token to match the original Marian vocabulary and also prevent the token from being generated (see [this issue](https://forum.opennmt.net/t/the-helsinki-nlp-opus-mt-mul-en-template-cross-references-pad-with-ctranslate2/4968) on the forum).

tel34 · July 6, 2022, 8:53am

Thanks for this Is a new ct2-transformers-converter script available for download?

guillaumekln · July 6, 2022, 8:56am

Not yet. You can watch the GitHub repository to be notified when we release a new version.