Convert M2M model to CTranslate2

lectoai · July 2, 2021, 3:15pm

@guillaumekln Thanks for the great ctranslate2 library.

With this release which supports conversion of Transformer models trained with Fairseq, is it possible to convert the M2M100_418M model from Facebook AI too? I can’t seem to find straightforward examples of similar models which were converted to ctranslate2 so far. The original model is here while there’s a Huggingface transformer version available here

I was successfully able to convert the WMT16 model but it seems to have quite a different model structure.

Here’s the conversion script I used:

import os
import ctranslate2

data_dir = os.path.join(
    "path",
    "to",
    "wmt16.en-de.joined-dict.transformer"
)

converter = ctranslate2.converters.FairseqConverter(
    os.path.join(data_dir, "model.pt"), data_dir
)
output_dir = str(data_dir.join("ctranslate2_model"))

converter.convert(output_dir)

Run from the cloned CTranslate2 repo with:

python3 python/wmt16_converter.py

Many thanks for your help!

guillaumekln · July 2, 2021, 3:42pm

Did you try to convert the M2M model? If yes, what errors are reported?

lectoai · July 3, 2021, 9:00am

Hi,
This is the script I tried to use with the Huggingface M2M100 model

import os
import ctranslate2

# relative path to where the script is run from
data_dir = os.path.join(
    "path...",
    "m2m100_418M"
)

# huggingface transformer m2m100 model
# Ref: https://huggingface.co/facebook/m2m100_418M
converter = ctranslate2.converters.FairseqConverter(
    os.path.join(data_dir, "pytorch_model.bin"), data_dir
)
output_dir = "/path/m2m_100/ctranslate2_model"
converter.convert(output_dir)

This is the error I got:

python3 python/m2m_100_converter.py

Traceback (most recent call last):
  File "python/m2m_100_converter.py", line 23, in <module>
    converter.convert(output_dir)
  File "/<path>/github.com/OpenNMT/CTranslate2/python/ctranslate2/converters/converter.py", line 45, in convert
    model_spec = self._load()
  File "/<path>/github.com/OpenNMT/CTranslate2/python/ctranslate2/converters/fairseq.py", line 84, in _load
    checkpoint = checkpoint_utils.load_checkpoint_to_cpu(self._model_path)
  File "<path>/Library/Python/3.8/lib/python/site-packages/fairseq/checkpoint_utils.py", line 228, in load_checkpoint_to_cpu
    args = state["args"]
KeyError: 'args'

lectoai · July 3, 2021, 10:06am

I also tried with the original Fairseq M2M100_418M model from here and got an error.

Script:

import os
import ctranslate2

# relative path to where the script is run from
data_dir = os.path.join(
    "path...",
    "m2m100_original"
)

# original fairseq m2m100 model
# Ref: https://github.com/pytorch/fairseq/tree/master/examples/m2m_100
converter = ctranslate2.converters.FairseqConverter(
    os.path.join(data_dir, "418M_last_checkpoint.pt"), data_dir
)
output_dir = "/path/m2m100_original/ctranslate2_model"
converter.convert(output_dir)

Error:

python3 python/m2m_100_original_converter.py

External language dictionary is not provided; use lang-pairs to infer the set of supported languages. The language ordering is not stable which might cause misalignment in pretraining and finetuning.
Traceback (most recent call last):
  File "python/m2m_100_original_converter.py", line 23, in <module>
    converter.convert(output_dir)
  File "<path>/OpenNMT/CTranslate2/python/ctranslate2/converters/converter.py", line 45, in convert
    model_spec = self._load()
  File "<path>/OpenNMT/CTranslate2/python/ctranslate2/converters/fairseq.py", line 92, in _load
    task = fairseq.tasks.setup_task(args)
  File "<path>/Python/3.8/lib/python/site-packages/fairseq/tasks/__init__.py", line 28, in setup_task
    return TASK_REGISTRY[task_cfg.task].setup_task(task_cfg, **kwargs)
  File "<path>/Python/3.8/lib/python/site-packages/fairseq/tasks/translation_multi_simple_epoch.py", line 106, in setup_task
    langs, dicts, training = MultilingualDatasetManager.prepare(
  File "<path>/Python/3.8/lib/python/site-packages/fairseq/data/multilingual/multilingual_data_manager.py", line 371, in prepare
    dicts[lang] = load_dictionary(
  File "<path>/Python/3.8/lib/python/site-packages/fairseq/tasks/fairseq_task.py", line 54, in load_dictionary
    return Dictionary.load(filename)
  File "<path>/Python/3.8/lib/python/site-packages/fairseq/data/dictionary.py", line 214, in load
    d.add_from_file(f)
  File "<path>/Python/3.8/lib/python/site-packages/fairseq/data/dictionary.py", line 227, in add_from_file
    raise fnfe
  File "<path>/Python/3.8/lib/python/site-packages/fairseq/data/dictionary.py", line 224, in add_from_file
    with open(PathManager.get_local_path(f), "r", encoding="utf-8") as fd:
FileNotFoundError: [Errno 2] No such file or directory: '<path>/m2m100_original/dict.af.txt'

alexeir · July 3, 2021, 10:40am

We also trying to convert M2M-100 1.2B with “ct2-fairseq-converte”

https://dl.fbaipublicfiles.com/m2m_100/1.2B_last_checkpoint.pt

and get error:

✗ aws-fb-test ~ $ ct2-fairseq-converter --model_path /root/fairseq/1.2B_last_checkpoint.pt --data_dir /root/fairseq/ --output_dir /tmp/out --force
Traceback (most recent call last):
File “/root/.pyenv/versions/3.8.1/bin/ct2-fairseq-converter”, line 8, in
sys.exit(main())
File “/root/.pyenv/versions/3.8.1/lib/python3.8/site-packages/ctranslate2/bin/fairseq_converter.py”, line 18, in main
converters.FairseqConverter(args.model_path, args.data_dir).convert_from_args(args)
File “/root/.pyenv/versions/3.8.1/lib/python3.8/site-packages/ctranslate2/converters/converter.py”, line 31, in convert_from_args
return self.convert(
File “/root/.pyenv/versions/3.8.1/lib/python3.8/site-packages/ctranslate2/converters/converter.py”, line 45, in convert
model_spec = self._load()
File “/root/.pyenv/versions/3.8.1/lib/python3.8/site-packages/ctranslate2/converters/fairseq.py”, line 92, in _load
task = fairseq.tasks.setup_task(args)
File “/root/fairseq/fairseq/tasks/init.py”, line 44, in setup_task
return task.setup_task(cfg, **kwargs)
File “/root/fairseq/fairseq/tasks/translation_multi_simple_epoch.py”, line 125, in setup_task
langs, dicts, training = MultilingualDatasetManager.prepare(
File “/root/fairseq/fairseq/data/multilingual/multilingual_data_manager.py”, line 311, in prepare
if args.langtoks is None:
AttributeError: ‘Namespace’ object has no attribute ‘langtoks’

guillaumekln · July 5, 2021, 10:21am

I updated the converter to support M2M models:

For conversion, the path to the single vocabulary file should be passed to the fixed_dictionary option (same name as the Fairseq option).
For translation, the language tags should be included in the input like this:

translator.translate_batch(
    [["__en__", "▁Hello", "▁World", "!"], ["__en__", "▁Hello", "▁World", "!"]],
    target_prefix=[["__fr__"], ["__es__"]],
)

This example translates the same English sentence in French and Spanish.

I tested the 418M model and it seems to work fine, but if you can verify on your side that would be great!

alexeir · July 6, 2021, 11:23am

1.2B converted fine.

But the biggest, 12B model (12b_last_chk_4_gpus.pt) gives conversion error. It’s because of shared across several GPU’s ?

Traceback (most recent call last):
File “/root/.local/share/virtualenvs/r10-MnTosGMW/bin/my-convert”, line 13, in
converter.convert(output_dir)
File “/root/.local/share/virtualenvs/r10-MnTosGMW/lib/python3.8/site-packages/ctranslate2/converters/converter.py”, line 45, in convert
model_spec = self._load()
File “/root/.local/share/virtualenvs/r10-MnTosGMW/lib/python3.8/site-packages/ctranslate2/converters/fairseq.py”, line 94, in _load
model_spec = _get_model_spec(args)
File “/root/.local/share/virtualenvs/r10-MnTosGMW/lib/python3.8/site-packages/ctranslate2/converters/fairseq.py”, line 61, in _get_model_spec
utils.raise_unsupported(reasons)
File “/root/.local/share/virtualenvs/r10-MnTosGMW/lib/python3.8/site-packages/ctranslate2/converters/utils.py”, line 16, in raise_unsupported
raise ValueError(message)
ValueError: The model you are trying to convert is not supported by CTranslate2. We identified the following reasons:

Option --arch transformer_wmt_en_de_big_pipeline_parallel is not supported (supported architectures are: transformer_wmt_en_de_big, transformer_tiny, transformer_vaswani_wmt_en_fr_big, transformer_wmt_en_de, transformer, transformer_vaswani_wmt_en_de_big, transformer_iwslt_de_en, transformer_wmt_en_de_big_t2t)

guillaumekln · July 6, 2021, 12:06pm

Yes, the 12B version is using a different model architecture in order to distribute it on several GPUs.

For now it is not supported, but I will check how CTranslate2 is handling these gigantic models. A 48GB model is multiple times bigger than what we ever tested.

alexeir · July 6, 2021, 2:31pm

is it possible to convert 12B model architecture to single GPU ?

guillaumekln · July 7, 2021, 7:25am

I think the quantized 12B model would barely run on a 16GB GPU. But before that there would be other issues to address, for example the converter currently requires 2 times the model size in memory to run.

argosopentech · June 19, 2022, 4:38pm

I’m having the same issue as @alexeir and getting a AttributeError: ‘Namespace’ object has no attribute ‘langtoks’ error trying to convert the M2M-100 418M FairSeq model.

I installed fairseq and ctranslate2 and tried following the example in the CTranslate2 docs but get this error:

$ ls
418M_last_checkpoint.pt  model_dict.128k.txt  spm.128k.model

$ ct2-fairseq-converter --data_dir . --model_path 418M_last_checkpoint.pt --fixed_dictionary model_dict.128k.txt  --output_dir m2m_100_418m_ct2
Traceback (most recent call last):
  File "/home/argosopentech/temp/env/bin/ct2-fairseq-converter", line 8, in <module>
    sys.exit(main())
  File "/home/argosopentech/temp/env/lib/python3.10/site-packages/ctranslate2/converters/fairseq.py", line 340, in main
    converter.convert_from_args(args)
  File "/home/argosopentech/temp/env/lib/python3.10/site-packages/ctranslate2/converters/converter.py", line 50, in convert_from_args
    return self.convert(
  File "/home/argosopentech/temp/env/lib/python3.10/site-packages/ctranslate2/converters/converter.py", line 89, in convert
    model_spec = self._load()
  File "/home/argosopentech/temp/env/lib/python3.10/site-packages/ctranslate2/converters/fairseq.py", line 167, in _load
    task = fairseq.tasks.setup_task(args)
  File "/home/argosopentech/temp/env/lib/python3.10/site-packages/fairseq/tasks/__init__.py", line 46, in setup_task
    return task.setup_task(cfg, **kwargs)
  File "/home/argosopentech/temp/env/lib/python3.10/site-packages/fairseq/tasks/translation_multi_simple_epoch.py", line 127, in setup_task
    langs, dicts, training = MultilingualDatasetManager.prepare(
  File "/home/argosopentech/temp/env/lib/python3.10/site-packages/fairseq/data/multilingual/multilingual_data_manager.py", line 342, in prepare
    if args.langtoks is None:
AttributeError: 'Namespace' object has no attribute 'langtoks'

I’m using ctranslate2 2.19.0 with Python 3.10.4 on Ubuntu 22.04.

Is M2M-100 supported?

ymoslem · June 19, 2022, 6:44pm

@argosopentech I have already converted them some time ago. You can find the links here.

gist.github.com

https://gist.github.com/ymoslem/a414a0ead0d3e50f4d7ff7110b1d1c0d

M2M-100-example.py

# This example uses M2M-100 models converted to the CTranslate2 format.
# Download CTranslate2 models:
# • M2M-100 418M-parameter model: https://bit.ly/33fM1AO
# • M2M-100 1.2B-parameter model: https://bit.ly/3GYiaed


import ctranslate2
import sentencepiece as spm

This file has been truncated. show original

guillaumekln · June 20, 2022, 7:46am

The conversion works fine with fairseq==0.10.2, but they recently released fairseq==0.12.1 which produces this error. I will look to support this newer Fairseq version.