Ctranslate2 Supports MADLAD-400

It seems there’s some interest for MADLAD-400 models. Well, good news, it’s a T5 model and it works out-of-the-box with ctranslate2 (or it should).
The first problem was the tokenizer provided by jbochi on huggingface having slight issues regarding vocab size. It’s now has been fixed. He also updated other models after my pull request for the 3b model.
Now, ct2-transformers-converter can be used to convert MADLAD-400 models to ctranslate2 models. But you WILL encounter gibberish output. The fix has been merged to the main repo but it’s not been released yet. the quick fix is to change "decoder_start_token": "<s>" to "decoder_start_token": "<unk>" in config.json file created after converting.
Hope this helps. :wink:

2 Likes

CT2 has been released 3.22 taking into account this fix.

2 Likes

Code as recommended at GitHub issue 1560

Convert the model

ct2-transformers-converter --model google/madlad400-3b-mt --quantization int8 --output_dir ct2-madlad400-3b-mt-int8
ct2-transformers-converter --model google/madlad400-7b-mt --quantization int8 --output_dir ct2-madlad400-7b-mt-int8

Translation

import ctranslate2
import transformers

ct2_model_path = "ct2-madlad400-3b-mt-int8"  # or "ct2-madlad400-7b-mt-int8"
device = "cuda"  # or "cpu"
translator = ctranslate2.Translator(ct2_model_path, device)
tokenizer = transformers.AutoTokenizer.from_pretrained("jbochi/madlad400-3b-mt")

tgt_code = "<2en>"
text = "大家好"

input_text = tgt_code + text
input_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input_text))

results = translator.translate_batch([input_tokens])
output_tokens = results[0].hypotheses[0]
output_text = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))

print(output_text)

Output

Hello, everyone.