NLLB-200 refers to a range of open-source pre-trained machine translation models. They can be used via FairSeq or Hugging Face Transformers. Recently, CTranslate2 has introduced inference support for some Transformers models, including NLLB. This tutorial aims at providing ready-to-use models in the CTranslate2 format, and code examples for using these NLLB models in CTranslate2 along with SentencePiece tokenization.
I just realized that for NLLB, the source language token should come after the source sentence (not before it, as in M2M-100). There is also a </s> token before the source language token. Hence, for the flores200 subword model to work well with SentencePiece, the source sentence tokens should be appended by these two tokens. This step is a must, and it dramatically affects the quality of the translation.
source_sents_subworded = [sent + ["</s>", src_lang] for sent in source_sents_subworded]
Interestingly, even the target starts with the two tokens ["</s>", tgt_lang]. However, as far as I can see in the results, adding on [tgt_lang] as a target prefix should be enough.
Evaluation results on the TICO-19 dataset (3,070 segments) for English-to-Arabic, English-to-French, and English-to-Kinyarwanda language pairs. NLLB and OPUS models were converted to the CTranslate2 format with int8 quantization.
If you manage to finetune something, I’m interested to see the results.
Bear in mind that they use a 256+K vocab size and you may be tempted to update to a reduced size especially if you don’t use some scripts.
@vince62s I’m new to all of this and was originally looking to use a bunch of OPUS-MT language-pair models to do a bunch of adhoc translation in a social network platform I’m building. But NLLB looks like quite a promising way to simplify it all. So, why do you say that NLLB is useless? Is it because it is much slower than OPUS-MT? Do you have a specific recommendation? Thanks!
Thanks! Yeah, for the most part I’ll be running the top languages rather than every possible language.
I’m not looking to do any training myself - I just want something pretrained that has a good balance of quality and speed of translation. Required server memory is also a factor - I’d like to have the model(s) preloaded to eliminate cold start time. I figured NLLB would suit these needs well, as compared to keeping 10-50 OPUS models preloaded.
So, when you say NLLB is “not very good” for top 10 languages, what do you mean? The tables at the top of this thread seem to show NLLB performing fairly similarly to OPUS and Google for some major language pairs…
Hi, I would like to replicate this experiment in my hardware and try some kind of domain adaptation with this model. If I can achieve some improvements in the 600M, I will try to repeat this with the bigger models.
How do you finetune NLLB? Should I take anything into account before training or is it easy to do?