No Language Left Unlocked dataset

In 2022 Meta released NLLB, a set of multi-lingual models for machine translation with impressive performance. But the model weights have been released using a restrictive non-commercial license, making them unusable for most open-source projects. The models also suffer by having a limited dictionary, which causes many translations to return unknown tokens.

This repository contains the software to run NLLU, an effort to run NLLB inference at scale to generate a corpus of bitext data that can be used to train new, permissively licensed language models.

Running NLLB inference on million of sentences is intensive and it would take years to perform on a single machine. We designed a simple server architecture which can distribute batches of sentences to be translated asynchronously across machines, which can be rented cheaply with providers such as or