It seems there’s some interest for MADLAD-400 models. Well, good news, it’s a T5 model and it works out-of-the-box with ctranslate2 (or it should).
The first problem was the tokenizer provided by jbochi on huggingface having slight issues regarding vocab size. It’s now has been fixed. He also updated other models after my pull request for the 3b model.
Now, ct2-transformers-converter
can be used to convert MADLAD-400 models to ctranslate2 models. But you WILL encounter gibberish output. The fix has been merged to the main repo but it’s not been released yet. the quick fix is to change "decoder_start_token": "<s>"
to "decoder_start_token": "<unk>"
in config.json file created after converting.
Hope this helps.
2 Likes
CT2 has been released 3.22 taking into account this fix.
2 Likes
Code as recommended at GitHub issue 1560
Convert the model
ct2-transformers-converter --model google/madlad400-3b-mt --quantization int8 --output_dir ct2-madlad400-3b-mt-int8
ct2-transformers-converter --model google/madlad400-7b-mt --quantization int8 --output_dir ct2-madlad400-7b-mt-int8
Translation
import ctranslate2
import transformers
ct2_model_path = "ct2-madlad400-3b-mt-int8" # or "ct2-madlad400-7b-mt-int8"
device = "cuda" # or "cpu"
translator = ctranslate2.Translator(ct2_model_path, device)
tokenizer = transformers.AutoTokenizer.from_pretrained("jbochi/madlad400-3b-mt")
tgt_code = "<2en>"
text = "大家好"
input_text = tgt_code + text
input_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input_text))
results = translator.translate_batch([input_tokens])
output_tokens = results[0].hypotheses[0]
output_text = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))
print(output_text)
Output
Hello, everyone.