Paper: [2309.04662] MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
Models (3B, 7.2B, 10B + 8B LM): https://github.com/google-research/google-research/tree/master/madlad_400
Looking forward to independent testing of this! Their paper does benchmark against NLLB. Can’t seem to figure out the licensing usage for MADLAD models. Seems like ODC-BY (just like the dataset) which allows commercial usage.
Please correct me if I’m wrong about anything.