DesktopTranslator: Windows GUI Excusable based on CTranslate2

ymoslem · February 11, 2022, 4:53pm

Test Dataset: 2011 segments after removing duplicates: [English] [Japanese] (*)

(*) Original source: CourseraParallelCorpus - Test (Human-validated)

• M2M-100 418M-parameter model

Beam Size: 5 / 3
BLEU: 24.8

Beam Size: 2
BLEU: 24.6

• M2M-100 1.2B-parameter model

Beam Size: 5 / 3
BLEU: 26.4

Beam Size: 2
BLEU: 26.1

Although both beam sizes 3 and 5 give the same BLEU score, running diff on both target files shows they are not the same at all. Obviously, this would require human evaluation.

Kind regards,
Yasmin

JptoEn · February 11, 2022, 7:41pm

By the way this is what I was talking about with how my model produces semantically similar sentences, but leans towards Japanese verb forms rather than those derived from Chinese like in the WMT reference.

This is something that BLEU penalizes unfairly.

ymoslem · February 11, 2022, 8:17pm

Thanks, Matthew! I understand your point, as illustrated by your screenshot. Japanese is a sophisticated language. Maybe it can benefit from some semantic evaluation metric.

ymoslem · February 17, 2022, 12:54am

Fixed

Source language code as source prefix token
Sentence splitting for non-Latin languages

Added

Source language detection
macOS executable app

aazzoma · March 9, 2022, 4:38pm

Thank you Yasmin
how to add NMT Pre-trained Models to DeskTopTranslator?
https://www.machinetranslation.io/nmt-pretrained-models

ymoslem · March 10, 2022, 5:37pm

Dear Muhammad,

Models need to be in the CTranslate2 format.

As for the French-to-English model, you can find a recent version in the CTranslate2 format here.

As for the English-to-Arabic model, this was an experimental model trained on about 400k segments from MS Terminology. It used RNN-LSTM, not the Transformer model, so it cannot be converted to the CTranslate2 format.

For training an English-to-Arabic model, I would recommend using enough data from OPUS (maybe, avoid crawled corpora), and applying the Transformer model. I am working on a new English-to-Arabic model, and I can publish it once it is finished.

Domain Adaptation

For Domain Adaptation, i.e. to create specialized models, one needs to have a good baseline model trained on enough (general) data, and then fine-tune it on in-domain data. This is because usually in-domain data is less, and might not be enough to train a strong model from scratch. There are multiple ways for Domain Adaptation. For example, I explained Mixed Fine-tuning (Chu et al., 2017) in this blog.

Pre-trained Models

Nowadays, you can find a lot of pre-trained models. Obviously, not all of them of good quality, but you can try.

M2M-100 model supports 100 languages, including Arabic. You can find a CTranslate2 version of it that you can use in DesktopTranslator here.
Argos Translate models: Argos Translate is another good tool. It also supports CTranslate2 models. So you can download the model you want from the list of models. Then, change the extension to zip and extract it. You will find the CTranslate2 model and SentencePiece model, that you can use in DesktopTranslator as well.
Hugging Face models. However, most likely one should use them with the transformers library.

I hope this helps. If you have more questions, please let me know.

Kind regards,
Yasmin

guillaumekln · March 11, 2022, 9:07am

You may also be interested in the latest CTranslate2 version which added a converter for the 1000+ pretrained models from OPUS-MT. See the “Marian” example in the quickstart.

ymoslem · March 11, 2022, 9:57am

This is great news. Thanks a lot, Guillaume! I see you also added support for mBART.

tel34 · March 11, 2022, 10:24am

This is good and timely news for me. Thanks

liuxf · March 27, 2022, 2:21am

Thank you for your wonderful work!

I have a GPU and use your DesktopTranslator on Windows10. I want to use ctranslate2 with GPU, so I change your code as follows:

self.translator = ctranslate2.Translator(
self.model_dir,
device=“gpu”
)

It doesn’t work.

Does ctranslate2 support GPU on windows?

Thanks!

ymoslem · March 27, 2022, 3:39am

Dear Liu!

Please try device="cuda"

Kind regards,
Yasmin

liuxf · March 27, 2022, 9:51am

Thank you, Yasmin!

I tried device=‘cuda’, and the program didn’t work and returned the following error:

Warning : load_model does not return WordVectorModel or SupervisedModel any more, but a FastText object which is very similar.
Exception in Tkinter callback
Traceback (most recent call last):
File “D:\Python\Python38\lib\tkinter_init_.py”, line 1883, in call
return self.func(*args)
File “D:/kidden/mt/open/DesktopTranslator/translator.py”, line 479, in translate_input
translations_tok = self.translator.translate_batch(
RuntimeError: Library cublas64_11.dll is not found or cannot be loaded

Kind regards,
Liu Xiaofeng

ymoslem · March 27, 2022, 5:12pm

Dear Liu,

Does the app work well with “cpu”? If so, could you please try to fix the “cuda” issue independently first.

If you run the following code in Python, what do you get? Replace "ctranslate2_model" with the path to a CTranslate2 model. Please try the code once with device="cpu" and once with device="cuda"

import ctranslate2

translator = ctranslate2.Translator("ctranslate2_model", device="cuda")
batch = [["▁H", "ello", "▁world", "!"]]
translator.translate_batch(batch)

Kind regards,
Yasmin

liuxf · March 28, 2022, 1:31am

Hi, Yasmin

Yes, I can run the app with ‘cpu’.

The code can be run with ‘cpu’, and failed with ‘cuda’. The run error is as follows:

Traceback (most recent call last):
File “D:/kidden/mt/open/mt-ex/temp/test_ct2.py”, line 5, in
translator.translate_batch(batch)
RuntimeError: Library cublas64_11.dll is not found or cannot be loaded

I run it on Windows10 with a GPU. My GPU settings have no problem because the CTranslate2 model was trained and converted on it.

Can you run it on Windows with device=‘cuda’?

Thanks!

ymoslem · March 28, 2022, 8:22am

Thanks! Kindly check this issue. I am adding @guillaumekln for more insights.

guillaumekln · March 28, 2022, 8:27am

The CUDA toolkit should be installed on the system in order to use the GPU:

Any CUDA version >= 11.2 should work.

liuxf · March 28, 2022, 9:54am

Thank you, Yasmin and Guillaume!

Indeed it is because of the version of CUDA. I installed CUDA 10.1 for my GPU.

By the way, this forum is so great and I learned a lot from it.

vince62s · January 13, 2023, 8:57am

Do you happen to have the WMT19 EN-ZH and ZH-EN scores ?

I am curious to see if the M2M models have the same issue as the NLLB200 on those CJK languages.

Thanks

ymoslem · January 14, 2023, 6:33am

Code update:

Currently, it should be out_type=str, i.e. without quotes, or use sp.encode_as_pieces() instead.

The up-to-date version is always here:

gist.github.com

https://gist.github.com/ymoslem/a414a0ead0d3e50f4d7ff7110b1d1c0d

M2M-100-example.py

# This example uses M2M-100 models converted to the CTranslate2 format.
# Download CTranslate2 models:
# • M2M-100 418M-parameter model: https://bit.ly/33fM1AO
# • M2M-100 1.2B-parameter model: https://bit.ly/3GYiaed


import ctranslate2
import sentencepiece as spm

This file has been truncated. show original

ymoslem · January 14, 2023, 6:52am

Hi Vincent!

Here you go, English-to-Chinese results on the TICO-19 dataset:

M2M-100 1.2B:
spBLEU: 28.07
ChrF++: 36.38
TER: 101.31
COMET: 52.22

NLLB-200 1.2B:
spBLEU: 29.02
ChrF++: 37.45
TER: 110.22
COMET: 50.05

NLLB-200 3.3B:
spBLEU: 31.35
ChrF++: 39.08
TER: 109.52
COMET: 53.89

Kind regards,
Yasmin