The M2M model was trained on the true-case, so there is no need to lower-case the text.
Regarding sub-wording, if you use the UI, you just have to select the SentencePiece model found in the same folder of M2M download, and pass a regular text. If you rather want to use M2M yourself, you will have to use this SentencePiece model to sub-word the source text first
M2M-100 uses language codes like "__en__", "__ja__", "__fr__", etc.
I will let you know when I adjust the GUI. For now, if you like to try M2M-100 models, you can use the following code. No need to adjust the source or target, as the code below takes care of this. Just make sure you modify the first 4 lines for file paths. If you use a different language pair/direction, change src_prefix and tgt_prefix as well.
import sentencepiece as spm
# [Modify] Set file paths and language prefixes
source_file_path = "source_test.en"
target_file_path = "target_test.ja"
sp_model_path = "m2m100_ct2/spm.128k.model"
ct_model_path = "m2m100_ct2/"
src_prefix = "__en__"
tgt_prefix = "__ja__"
# Load the source SentecePiece model
sp = spm.SentencePieceProcessor()
# Open the source file
with open(source_file_path, "r") as source:
lines = source.readlines()
source_sents = [line.strip() for line in lines]
target_prefix = [[tgt_prefix]] * len(source_sents)
# Subword the source sentences
source_sents_subworded = sp.encode(source_sents, out_type="str")
source_sents_subworded = [[src_prefix] + sent for sent in source_sents_subworded]
print("First sentence:", source_sents_subworded)
# Translate the source sentences
translator = ctranslate2.Translator(ct_model_path, device="cpu") # or "cuda" for GPU
translations = translator.translate_batch(source_sents_subworded, batch_type="tokens", max_batch_size=2024, target_prefix=target_prefix)
translations = [translation['tokens'] for translation in translations]
# Desubword the target sentences
translations_desubword = sp.decode(translations)
translations_desubword = [sent[len(tgt_prefix):] for sent in translations_desubword]
print("First translation:", translations_desubword)
# Save the translations to the a file
with open(target_file_path, "w+", encoding="utf-8") as target:
for line in translations_desubword:
target.write(line.strip() + "\n")
print("Done! Target file saved at:", target_file_path)
By the way this is what I was talking about with how my model produces semantically similar sentences, but leans towards Japanese verb forms rather than those derived from Chinese like in the WMT reference.
As for the French-to-English model, you can find a recent version in the CTranslate2 format here.
As for the English-to-Arabic model, this was an experimental model trained on about 400k segments from MS Terminology. It used RNN-LSTM, not the Transformer model, so it cannot be converted to the CTranslate2 format.
For training an English-to-Arabic model, I would recommend using enough data from OPUS (maybe, avoid crawled corpora), and applying the Transformer model. I am working on a new English-to-Arabic model, and I can publish it once it is finished.
For Domain Adaptation, i.e. to create specialized models, one needs to have a good baseline model trained on enough (general) data, and then fine-tune it on in-domain data. This is because usually in-domain data is less, and might not be enough to train a strong model from scratch. There are multiple ways for Domain Adaptation. For example, I explained Mixed Fine-tuning (Chu et al., 2017) in this blog.
Nowadays, you can find a lot of pre-trained models. Obviously, not all of them of good quality, but you can try.
M2M-100 model supports 100 languages, including Arabic. You can find a CTranslate2 version of it that you can use in DesktopTranslator here.
Argos Translate models: Argos Translate is another good tool. It also supports CTranslate2 models. So you can download the model you want from the list of models. Then, change the extension to zip and extract it. You will find the CTranslate2 model and SentencePiece model, that you can use in DesktopTranslator as well.