OpenNMT

DesktopTranslator: Windows GUI Excusable based on CTranslate2

Hi Colleagues!

DesktopTranslator is a cross-platform GUI with Python for a translator based on CTranslate2. The app is tested on Windows and Mac.

You can also download a Windows excusable installer and a macOS app, for DesktopTranslator.
DesktopTranslator DesktopTranslator

Thanks to Guillaume @guillaumekln and the CTranslate2 developers for supporting Windows.

The example is in Python, and still the performance is reasonable. I expect a C++ GUI or API can even achieve better performance.

Kind regards,
Yasmin

6 Likes

Now, supports M2M-100 multilingual models through CTranslate2.

4 Likes

Awesome to see the M2M integration, testing it now.

Should we pass in lowercased or sentencepiece encoded text for best results or is it handled during inference?

Hi Matthew!

Thanks!

  • The M2M model was trained on the true-case, so there is no need to lower-case the text.
  • Regarding sub-wording, if you use the UI, you just have to select the SentencePiece model found in the same folder of M2M download, and pass a regular text. If you rather want to use M2M yourself, you will have to use this SentencePiece model to sub-word the source text first

I hope this answer your questions.

Kind regards,
Yasmin

1 Like

Some tests on the WMT 2020 JP set with SacreBleu.

M2M 3 Beam: 18.6
M2M 5 Beam: 18.5
My best model (pre-ensemble): 19.1
Google: 24.4

Thanks for sharing, Matthew!

Are these results with the UI or with M2M-100 CTranslate2 models?

If you are trying with M2M-100 CTranslate2 models, please make sure you add both source prefix and target prefix, for language codes (e.g. "__en__" and "__ja__").

Currently, the UI adds only the target prefix. I will modify it to run source language detection, and add the source prefix soon.

Kind regards,
Yasmin

Sure thing.

With the Desktop GUI 1.2B Ctranslate model, ok I will see check how appending source tokens helps. What is the correct formatting?

“<en>English sentence<en>”
“<jp>日本語<jp>”

M2M-100 uses language codes like "__en__", "__ja__", "__fr__", etc.

I will let you know when I adjust the GUI. For now, if you like to try M2M-100 models, you can use the following code. No need to adjust the source or target, as the code below takes care of this. Just make sure you modify the first 4 lines for file paths. If you use a different language pair/direction, change src_prefix and tgt_prefix as well.

import ctranslate2
import sentencepiece as spm


# [Modify] Set file paths and language prefixes
source_file_path = "source_test.en"
target_file_path = "target_test.ja"

sp_model_path = "m2m100_ct2/spm.128k.model"
ct_model_path = "m2m100_ct2/"

src_prefix = "__en__"
tgt_prefix = "__ja__"



# Load the source SentecePiece model
sp = spm.SentencePieceProcessor()
sp.load(sp_model_path)

# Open the source file
with open(source_file_path, "r") as source:
  lines = source.readlines()

source_sents = [line.strip() for line in lines]
target_prefix = [[tgt_prefix]] * len(source_sents)

# Subword the source sentences
source_sents_subworded = sp.encode(source_sents, out_type="str")
source_sents_subworded = [[src_prefix] + sent for sent in source_sents_subworded]
print("First sentence:", source_sents_subworded[0])

# Translate the source sentences
translator = ctranslate2.Translator(ct_model_path, device="cpu")  # or "cuda" for GPU
translations = translator.translate_batch(source_sents_subworded, batch_type="tokens", max_batch_size=2024, target_prefix=target_prefix)
translations = [translation[0]['tokens'] for translation in translations]

# Desubword the target sentences
translations_desubword = sp.decode(translations)
translations_desubword = [sent[len(tgt_prefix):] for sent in translations_desubword]
print("First translation:", translations_desubword[0])

# Save the translations to the a file
with open(target_file_path, "w+", encoding="utf-8") as target:
  for line in translations_desubword:
    target.write(line.strip() + "\n")

print("Done! Target file saved at:", target_file_path)

Including source lang tags it got 20.5

1 Like

Thanks, Matthew! I think this was with the default beam size (2). If you even tried beam_size=5 with M2M 418M or 1.2B, you might even get a better BLEU.

Kind regards,
Yasmin

Test Dataset: 2011 segments after removing duplicates: [English] [Japanese] (*)

(*) Original source: CourseraParallelCorpus - Test (Human-validated)


M2M-100 418M-parameter model

Beam Size: 5 / 3
BLEU: 24.8

Beam Size: 2
BLEU: 24.6


M2M-100 1.2B-parameter model

Beam Size: 5 / 3
BLEU: 26.4

Beam Size: 2
BLEU: 26.1


Although both beam sizes 3 and 5 give the same BLEU score, running diff on both target files shows they are not the same at all. Obviously, this would require human evaluation.

Kind regards,
Yasmin

1 Like

By the way this is what I was talking about with how my model produces semantically similar sentences, but leans towards Japanese verb forms rather than those derived from Chinese like in the WMT reference.

This is something that BLEU penalizes unfairly.

1 Like

Thanks, Matthew! I understand your point, as illustrated by your screenshot. Japanese is a sophisticated language. Maybe it can benefit from some semantic evaluation metric.

1 Like

Fixed

  • Source language code as source prefix token :roll_eyes:
  • Sentence splitting for non-Latin languages

Added

  • Source language detection
  • macOS executable app :point_up:
3 Likes

Thank you Yasmin
how to add NMT Pre-trained Models to DeskTopTranslator?
https://www.machinetranslation.io/nmt-pretrained-models

Dear Muhammad,

Models need to be in the CTranslate2 format.

As for the French-to-English model, you can find a recent version in the CTranslate2 format here.

As for the English-to-Arabic model, this was an experimental model trained on about 400k segments from MS Terminology. It used RNN-LSTM, not the Transformer model, so it cannot be converted to the CTranslate2 format.

For training an English-to-Arabic model, I would recommend using enough data from OPUS (maybe, avoid crawled corpora), and applying the Transformer model. I am working on a new English-to-Arabic model, and I can publish it once it is finished.


Domain Adaptation

For Domain Adaptation, i.e. to create specialized models, one needs to have a good baseline model trained on enough (general) data, and then fine-tune it on in-domain data. This is because usually in-domain data is less, and might not be enough to train a strong model from scratch. There are multiple ways for Domain Adaptation. For example, I explained Mixed Fine-tuning (Chu et al., 2017) in this blog.


Pre-trained Models

Nowadays, you can find a lot of pre-trained models. Obviously, not all of them of good quality, but you can try.

  • M2M-100 model supports 100 languages, including Arabic. You can find a CTranslate2 version of it that you can use in DesktopTranslator here.
  • Argos Translate models: Argos Translate is another good tool. It also supports CTranslate2 models. So you can download the model you want from the list of models. Then, change the extension to zip and extract it. You will find the CTranslate2 model and SentencePiece model, that you can use in DesktopTranslator as well.
  • Hugging Face models. However, most likely one should use them with the transformers library.

I hope this helps. If you have more questions, please let me know.

Kind regards,
Yasmin

You may also be interested in the latest CTranslate2 version which added a converter for the 1000+ pretrained models from OPUS-MT. See the “Marian” example in the quickstart.

2 Likes

This is great news. Thanks a lot, Guillaume! I see you also added support for mBART.

This is good and timely news for me. Thanks :slight_smile:

Thank you for your wonderful work!

I have a GPU and use your DesktopTranslator on Windows10. I want to use ctranslate2 with GPU, so I change your code as follows:

self.translator = ctranslate2.Translator(
self.model_dir,
device=“gpu”
)

It doesn’t work.

Does ctranslate2 support GPU on windows?

Thanks!