NLLB-200 with CTranslate2

NLLB-200 refers to a range of open-source pre-trained machine translation models. They can be used via FairSeq or Hugging Face Transformers. Recently, CTranslate2 has introduced inference support for some Transformers models, including NLLB. This tutorial aims at providing ready-to-use models in the CTranslate2 format, and code examples for using these NLLB models in CTranslate2 along with SentencePiece tokenization.

Download NLLB-200 models

Load the model and tokenizer

import ctranslate2
import sentencepiece as spm


# [Modify] Set paths to the CTranslate2 and SentencePiece models
ct_model_path = "nllb-200-3.3B-int8"
sp_model_path = "flores200_sacrebleu_tokenizer_spm.model"

device = "cuda"  # or "cpu"

# Load the source SentecePiece model
sp = spm.SentencePieceProcessor()
sp.load(sp_model_path)

translator = ctranslate2.Translator(ct_model_path, device)

Translate a list of sentences

source_sents = ["Ntabwo ntekereza ko iyi modoka ishaje izagera hejuru yumusozi.",
                "Kanda iyi buto hanyuma umuryango ukingure",
                "Ngendahimana yashakaga ikaramu"
               ]

# Source and target langauge codes
src_lang = "kin_Latn"
tgt_lang = "eng_Latn"

beam_size = 4

source_sentences = [sent.strip() for sent in source_sentences]
target_prefix = [[tgt_lang]] * len(source_sentences)

# Subword the source sentences
source_sents_subworded = sp.encode_as_pieces(source_sentences)
source_sents_subworded = [sent + ["</s>", src_lang] for sent in source_sents_subworded]
print("First subworded source sentence:", source_sents_subworded[0], sep="\n")

# Translate the source sentences
translator = ctranslate2.Translator(ct_model_path, device=device)
translations_subworded = translator.translate_batch(source_sents_subworded, batch_type="tokens", max_batch_size=2024, beam_size=beam_size, target_prefix=target_prefix)
translations_subworded = [translation[0]['tokens'] for translation in translations_subworded]
for translation in translations_subworded:
  if tgt_lang in translation:
    translation.remove(tgt_lang)

# Desubword the target sentences
translations = sp.decode(translations_subworded)


print("First sentence and translation:", source_sentences[0], translations[0], sep="\n• ")

Output:

Translations:
• I don’t think this old car will make it to the top of the hill.
• Click this button and the door will open.
• Ngendahimana was looking for a pen.


Notebook

You can also use this Google Colab notebook.

Licence of models

CC-BY-NC

Relevant projects

I just realized that for NLLB, the source language token should come after the source sentence (not before it, as in M2M-100). There is also a </s> token before the source language token. Hence, for the flores200 subword model to work well with SentencePiece, the source sentence tokens should be appended by these two tokens. This step is a must, and it dramatically affects the quality of the translation.

source_sents_subworded = [sent + ["</s>", src_lang] for sent in source_sents_subworded]

Interestingly, even the target starts with the two tokens ["</s>", tgt_lang]. However, as far as I can see in the results, adding on [tgt_lang] as a target prefix should be enough.