MADLAD-400: A Multilingual And Document-Level Large Audited Dataset + Model

Paper: [2309.04662] MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Models (3B, 7.2B, 10B + 8B LM): https://github.com/google-research/google-research/tree/master/madlad_400

Looking forward to independent testing of this! Their paper does benchmark against NLLB. Can’t seem to figure out the licensing usage for MADLAD models. Seems like ODC-BY (just like the dataset) which allows commercial usage.

Please correct me if I’m wrong about anything.

3 Likes

Dataset is out: allenai/MADLAD-400 · Datasets at Hugging Face