I was able to get M2M-100 working with CTranslate2 and have been trying to train a similar multilingual model from scratch using OpenNMT-py.
What is the best format to use for tokens that tell the language model what the source and target languages are? For M2M-100 I appended the source token to the source text and then called ctranslate2.Translator.translate_batch
with target_prefix=[[target_code_token]] * len(tokenized_sentences)
. Another option for format is to prepend the target code token to the source text like in this tutorial.
M2M-100 format:
__en__William Caxton (c. 1422 – c. 1491) was an English merchant, diplomat and writer.
__fr__William Caxton (c. 1422 – c. 1491) était un marchand, diplomate et écrivain anglais.
Prepend the target code to source text:
__fr__William Caxton (c. 1422 – c. 1491) was an English merchant, diplomat and writer.
William Caxton (c. 1422 – c. 1491) was an English merchant, diplomat and writer.
Prepend the source and target code to source text:
__en__ __fr__William Caxton (c. 1422 – c. 1491) was an English merchant, diplomat and writer.
William Caxton (c. 1422 – c. 1491) was an English merchant, diplomat and writer.
Is there any sort of industry standard for this? I think I prefer prepending the source and target code tokens to the source text but I also want to maximize compatibility with models trained by other people.
Additionally, how does the target_prefix
parameter work in CTranslate2? My understanding is that it prepends the provided prefixes to the target text while it is being decoded. The target_prefix
parameter works with M2M-100 but doesn’t seem to work with my OpenNMT-py model.
I created a multilingual dataset with 95949184 lines of data from Opus and formatted it in the M2M-100 format. I then trained a model with OpenNMT-py like I would for a individual language pair. I ran the model with CTranslate2 prepending the source code token to the source text and using the target_prefix
parameter for the target language. I get completely incorrect output and the target_prefix
doesn’t seem to affect the translation at all.
$ argos-translate -f en -t de "Cheese"
es ies.
$ argos-translate -f en -t fr "Cheese"
es ies.
$ argos-translate -f en -t es "Cheese"
es ies.
$ argos-translate -f en -t es "I'm flying to Miami next week."
Miami i Miami.
Will most pretrained models just need custom logic? I know this is often true when running models on Huggingface, different language models from different companies often need a custom tokenizer or other logic.