NLLB-200 with CTranslate2

I have try many ways, an example is this:

[[‘▁Dell’, ‘▁Control’, ‘Va’, ‘ult’, ‘TM’, ‘▁sur’, ‘▁les’, ‘▁ordin’, ‘ateurs’, ‘▁por’, ‘tables’, ‘▁Lat’, ‘itude’, ‘▁-’, ‘▁Sto’, ‘cke’, ‘▁et’, ‘▁tra’, ‘ite’, ‘▁les’, ‘▁informations’, ‘▁d’, “'”, ‘identi’, ‘fication
‘, ‘▁et’, ‘▁le’, ‘▁code’, ‘▁criti’, ‘que’, ‘▁en’, ‘▁dehors’, ‘▁des’, ‘▁ve’, ‘cteurs’, ‘▁d’, "’“, ‘atta’, ‘que’, ‘▁habitu’, ‘els’, ‘▁des’, ‘▁logi’, ‘ci’, ‘els’, ‘▁mal’, ‘ve’, ‘ill’, ‘ants’, ‘.’, ‘’, 'fra_Latn
'], [‘▁Ê’, ‘tre’, ‘▁capable’, ‘▁de’, ‘▁faire’, ‘▁du’, ‘▁cur’, ‘ling’, ‘▁était’, ‘▁un’, ‘▁autre’, ‘▁sport’, ‘▁dans’, ‘▁lequel’, ‘▁je’, ‘▁pou’, ‘vais’, ‘▁être’, ‘▁ac’, ‘tif’, ‘▁et’, ‘▁devenir’, ‘▁bon’, ‘.’, '”’, ’
', ‘fra_Latn’]]


[‘cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas c
as cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas
cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas
cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas ca
s cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas cas’, ‘ARBARBARBARBARBARBARBARBARBARB
ARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBARBAR ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇’]

In this example, I only add the language at the end of each segment. I have tried to add it to the beginning and to add also token.

I get same random results with all that I have tried.

I Have tried also this:

source_sents_subworded = [sent + ["</s>", src_lang] for sent in source_sents_subworded]

and this:

source_sents_subworded = [[src_lang] + sent + ["</s>"] for sent in source_sents_subworded]

I am doing this call:

translations = model.translate_batch(
            source_sents_subworded, batch_type="tokens", max_batch_size=batch_size, 
            beam_size=beam_size, target_prefix=target_prefix
    )

where target_prefix is [[‘eng_Latn’], [‘eng_Latn’]]

I did not have problems with c2translate NLLB models that Guillaume shared (the ones that are not finetuned). Same set up seems to not work with my transformed fine-tuned version, I tried all that but did not get the expected results.

you need 3.11 of CT2

1 Like

Yes, that was the problem. Thank you so much, now works perfectly with this set up:

source_sents_subworded = [[src_lang] + sent + ["</s>"] for sent in source_sents_subworded]
1 Like

Hoiw did you convert this model?

Example of converting an NLLB model to CTranslate2 with int8 quantization:

ct2-transformers-converter --model facebook/nllb-200-distilled-600M --quantization int8 --output_dir nllb-200-distilled-600M-int8

@ymoslem How much quality degradation did you observe with the int8 quantization vs float32? Thanks.

Hello! I didn’t test with NLLB, but with OPUS models it is just a fraction with CTranslate2. For example for one of OPUS English-Arabic models, the spBLEU was 29.32 without quantisation and
29.14 with quantisation. You can notice similar differences when converting OpenNMT-py or OpenNMT-tf models, too.

I’m trying to run this code on a AWS sagemaker notebook

import ctranslate2
import sentencepiece as spm


# [Modify] Set paths to the CTranslate2 and SentencePiece models
ct_model_path = "nllb-200-3.3B-int8"
sp_model_path = "flores200_sacrebleu_tokenizer_spm.model"

device = "cuda"  # or "cpu"

# Load the source SentecePiece model
sp = spm.SentencePieceProcessor()
sp.load(sp_model_path)

translator = ctranslate2.Translator(ct_model_path, device)

and I’m getting an error:

OSError: Not found: “flores200_sacrebleu_tokenizer_spm.model”: No such file or directory Error #2

any idea how to remedy this?

You can download this file in the first post of this thread.

Thanks for the quick response! Now when I run

ct_model_path = "nllb-200-600M-int8"
device = "cuda"
translator = ctranslate2.Translator(ct_model_path, device)

I get

RuntimeError: Unable to open file ‘model.bin’ in model ‘nllb-200-600M-int8’

Did you download the NLLB model and placed it in your current directory?

Yes, I downloaded it with
!wget "https://pretrained-nmt-models.s3.us-west-2.amazonaws.com/CTranslate2/nllb/nllb-200_600M_int8_ct2.zip"
Edit 1:
When I os.list_dir()
I see it downloaded as
'nllb-200_600M_int8_ct2.zip'
I tried
ct_model_path = "nllb-200_600M_int8_ct2"
and
ct_model_path = "nllb-200_600M_int8_ct2.zip"
and both give same error

Decompress the zip file first.

Hi,everyone
I’m trying to translate on some low-resource language like Uyghur.
As a beginner,I wonder is there any websites that I can find the sota model and their performance.
Is nllb suitable to the low-resource translation task?or I should choose other model.
Looking forward to your suggestions.

Hello!

You can find the list of languages supported by NLLB here; they include Uyghur. On the same page, there is a “metrics” link where you can see some results. The 3-letter language code for Uyghur is uig.

Practically, you might need to fine-tune the model to achieve better performance.

I am not sure if you are looking for an open-source model specifically, but some commercial MT systems support Uyghur, too.

All the best,
Yasmin

Thanks for your reply.
Sorry for my implicit description.
I am looking for an open-source model which support Uyghur translation.
I wonder where I can find those models which support Uyghur.
Can you give me some examples? or some url links to those commercial MT systems.
Looking forward to your reply

Please see the link I sent in the previous reply. At the Hugging Face website, you can search for models for certain languages.

As for commercial systems, the usual – Google Translate, Bing Translator, etc.

Best regards,
Yasmin

the model version on huggingface in not quantized , you don’t need to assign source language when I using transformers library.But in this case, I need pass source language. Is this caused by quantization?

hello,i have two issue.
1、when i translate zh to en,i find there are many “??”
“你好” → " ⁇ Hi, how are you?"
“制作简历相关” → " ⁇ ️ ⁇ ️ ⁇ ️ ⁇ "
2、en to zh
“hello,when i translate my text,i find there are many” → “zho_Hans hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello hello, when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when when”

my email is sun775618369@gmail.com. if you have solution, im very thank to you to send my email or reply to me in here. thanks!

Hello! Would you please provide the code you use.