NLLB-200 refers to a range of open-source pre-trained machine translation models. They can be used via FairSeq or Hugging Face Transformers. Recently, CTranslate2 has introduced inference support for some Transformers models, including NLLB. This tutorial aims at providing ready-to-use models in the CTranslate2 format, and code examples for using these NLLB models in CTranslate2 along with SentencePiece tokenization.
- NLLB 600M - CTranslate2 int8
- NLLB 1.3B - CTranslate2 int8
- NLLB 3.3B - CTranslate2 int8
- SentencePiece model - 200 languages
import ctranslate2 import sentencepiece as spm # [Modify] Set paths to the CTranslate2 and SentencePiece models ct_model_path = "nllb-200-3.3B-int8" sp_model_path = "flores200_sacrebleu_tokenizer_spm.model" device = "cuda" # or "cpu" # Load the source SentecePiece model sp = spm.SentencePieceProcessor() sp.load(sp_model_path) translator = ctranslate2.Translator(ct_model_path, device)
source_sents = ["Ntabwo ntekereza ko iyi modoka ishaje izagera hejuru yumusozi.", "Kanda iyi buto hanyuma umuryango ukingure", "Ngendahimana yashakaga ikaramu" ] # Source and target langauge codes src_lang = "kin_Latn" tgt_lang = "eng_Latn" beam_size = 4 source_sentences = [sent.strip() for sent in source_sentences] target_prefix = [[tgt_lang]] * len(source_sentences) # Subword the source sentences source_sents_subworded = sp.encode_as_pieces(source_sentences) source_sents_subworded = [sent + ["</s>", src_lang] for sent in source_sents_subworded] print("First subworded source sentence:", source_sents_subworded, sep="\n") # Translate the source sentences translator = ctranslate2.Translator(ct_model_path, device=device) translations_subworded = translator.translate_batch(source_sents_subworded, batch_type="tokens", max_batch_size=2024, beam_size=beam_size, target_prefix=target_prefix) translations_subworded = [translation['tokens'] for translation in translations_subworded] for translation in translations_subworded: if tgt_lang in translation: translation.remove(tgt_lang) # Desubword the target sentences translations = sp.decode(translations_subworded) print("First sentence and translation:", source_sentences, translations, sep="\n• ")
• I don’t think this old car will make it to the top of the hill.
• Click this button and the door will open.
• Ngendahimana was looking for a pen.
You can also use this Google Colab notebook.