It seems there’s some interest for MADLAD-400 models. Well, good news, it’s a T5 model and it works out-of-the-box with ctranslate2 (or it should).
The first problem was the tokenizer provided by jbochi on huggingface having slight issues regarding vocab size. It’s now has been fixed. He also updated other models after my pull request for the 3b model.
Now, ct2-transformers-converter
can be used to convert MADLAD-400 models to ctranslate2 models. But you WILL encounter gibberish output. The fix has been merged to the main repo but it’s not been released yet. the quick fix is to change "decoder_start_token": "<s>"
to "decoder_start_token": "<unk>"
in config.json file created after converting.
Hope this helps.
2 Likes
CT2 has been released 3.22 taking into account this fix.
2 Likes