The M2M model was trained on the true-case, so there is no need to lower-case the text.
Regarding sub-wording, if you use the UI, you just have to select the SentencePiece model found in the same folder of M2M download, and pass a regular text. If you rather want to use M2M yourself, you will have to use this SentencePiece model to sub-word the source text first
Are these results with the UI or with M2M-100 CTranslate2 models?
If you are trying with M2M-100 CTranslate2 models, please make sure you add both source prefix and target prefix, for language codes (e.g. "__en__" and "__ja__").
Currently, the UI adds only the target prefix. I will modify it to run source language detection, and add the source prefix soon.
M2M-100 uses language codes like "__en__", "__ja__", "__fr__", etc.
I will let you know when I adjust the GUI. For now, if you like to try M2M-100 models, you can use the following code. No need to adjust the source or target, as the code below takes care of this. Just make sure you modify the first 4 lines for file paths. If you use a different language pair/direction, change src_prefix and tgt_prefix as well.
import ctranslate2
import sentencepiece as spm
# [Modify] Set file paths and language prefixes
source_file_path = "source_test.en"
target_file_path = "target_test.ja"
sp_model_path = "m2m100_ct2/spm.128k.model"
ct_model_path = "m2m100_ct2/"
src_prefix = "__en__"
tgt_prefix = "__ja__"
# Load the source SentecePiece model
sp = spm.SentencePieceProcessor()
sp.load(sp_model_path)
# Open the source file
with open(source_file_path, "r") as source:
lines = source.readlines()
source_sents = [line.strip() for line in lines]
target_prefix = [[tgt_prefix]] * len(source_sents)
# Subword the source sentences
source_sents_subworded = sp.encode(source_sents, out_type="str")
source_sents_subworded = [[src_prefix] + sent for sent in source_sents_subworded]
print("First sentence:", source_sents_subworded[0])
# Translate the source sentences
translator = ctranslate2.Translator(ct_model_path, device="cpu") # or "cuda" for GPU
translations = translator.translate_batch(source_sents_subworded, batch_type="tokens", max_batch_size=2024, target_prefix=target_prefix)
translations = [translation[0]['tokens'] for translation in translations]
# Desubword the target sentences
translations_desubword = sp.decode(translations)
translations_desubword = [sent[len(tgt_prefix):] for sent in translations_desubword]
print("First translation:", translations_desubword[0])
# Save the translations to the a file
with open(target_file_path, "w+", encoding="utf-8") as target:
for line in translations_desubword:
target.write(line.strip() + "\n")
print("Done! Target file saved at:", target_file_path)
Thanks, Matthew! I think this was with the default beam size (2). If you even tried beam_size=5 with M2M 418M or 1.2B, you might even get a better BLEU.
Although both beam sizes 3 and 5 give the same BLEU score, running diff on both target files shows they are not the same at all. Obviously, this would require human evaluation.
By the way this is what I was talking about with how my model produces semantically similar sentences, but leans towards Japanese verb forms rather than those derived from Chinese like in the WMT reference.
Thanks, Matthew! I understand your point, as illustrated by your screenshot. Japanese is a sophisticated language. Maybe it can benefit from some semantic evaluation metric.
As for the French-to-English model, you can find a recent version in the CTranslate2 format here.
As for the English-to-Arabic model, this was an experimental model trained on about 400k segments from MS Terminology. It used RNN-LSTM, not the Transformer model, so it cannot be converted to the CTranslate2 format.
For training an English-to-Arabic model, I would recommend using enough data from OPUS (maybe, avoid crawled corpora), and applying the Transformer model. I am working on a new English-to-Arabic model, and I can publish it once it is finished.
Domain Adaptation
For Domain Adaptation, i.e. to create specialized models, one needs to have a good baseline model trained on enough (general) data, and then fine-tune it on in-domain data. This is because usually in-domain data is less, and might not be enough to train a strong model from scratch. There are multiple ways for Domain Adaptation. For example, I explained Mixed Fine-tuning (Chu et al., 2017) in this blog.
Pre-trained Models
Nowadays, you can find a lot of pre-trained models. Obviously, not all of them of good quality, but you can try.
M2M-100 model supports 100 languages, including Arabic. You can find a CTranslate2 version of it that you can use in DesktopTranslator here.
Argos Translate models: Argos Translate is another good tool. It also supports CTranslate2 models. So you can download the model you want from the list of models. Then, change the extension to zip and extract it. You will find the CTranslate2 model and SentencePiece model, that you can use in DesktopTranslator as well.