Convert small100 with ctranslate2

Jourdelune · December 27, 2022, 9:48pm

Hello, I have see this model on hugging face: alirezamsh/small100 · Hugging Face. Is it m2m100 12B distilled with same quality. I have tried to convert it with ctranslate2 with this command:

ct2-transformers-converter --model alirezamsh/small100 --output_dir m2m100_418
and I have this error:

2022-12-27 22:44:39.167109: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-27 22:44:39.937672: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-12-27 22:44:39.937756: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-12-27 22:44:39.937766: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/home/jourdelune/.local/lib/python3.10/site-packages/numpy/core/getlimits.py:500: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/jourdelune/.local/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
/home/jourdelune/.local/lib/python3.10/site-packages/numpy/core/getlimits.py:500: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  setattr(self, word, getattr(machar, word).flat[0])
/home/jourdelune/.local/lib/python3.10/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
  return self._float_to_str(self.smallest_subnormal)
Traceback (most recent call last):
  File "/usr/local/sbin/ct2-transformers-converter", line 8, in <module>
    sys.exit(main())
  File "/home/jourdelune/.local/lib/python3.10/site-packages/ctranslate2/converters/transformers.py", line 539, in main
    converter.convert_from_args(args)
  File "/home/jourdelune/.local/lib/python3.10/site-packages/ctranslate2/converters/converter.py", line 50, in convert_from_args
    return self.convert(
  File "/home/jourdelune/.local/lib/python3.10/site-packages/ctranslate2/converters/converter.py", line 97, in convert
    model_spec.validate()
  File "/home/jourdelune/.local/lib/python3.10/site-packages/ctranslate2/specs/model_spec.py", line 458, in validate
    super().validate()
  File "/home/jourdelune/.local/lib/python3.10/site-packages/ctranslate2/specs/model_spec.py", line 107, in validate
    self._visit(_check)
  File "/home/jourdelune/.local/lib/python3.10/site-packages/ctranslate2/specs/model_spec.py", line 219, in _visit
    visit_spec(self, fn)
  File "/home/jourdelune/.local/lib/python3.10/site-packages/ctranslate2/specs/model_spec.py", line 45, in visit_spec
    visit_spec(value, fn, scope=_join_scope(scope, name))
  File "/home/jourdelune/.local/lib/python3.10/site-packages/ctranslate2/specs/model_spec.py", line 43, in visit_spec
    visit_spec(elem, fn, scope=_join_scope(scope, "%s_%d" % (name, i)))
  File "/home/jourdelune/.local/lib/python3.10/site-packages/ctranslate2/specs/model_spec.py", line 45, in visit_spec
    visit_spec(value, fn, scope=_join_scope(scope, name))
  File "/home/jourdelune/.local/lib/python3.10/site-packages/ctranslate2/specs/model_spec.py", line 45, in visit_spec
    visit_spec(value, fn, scope=_join_scope(scope, name))
  File "/home/jourdelune/.local/lib/python3.10/site-packages/ctranslate2/specs/model_spec.py", line 47, in visit_spec
    fn(spec, _join_scope(scope, name), value)
  File "/home/jourdelune/.local/lib/python3.10/site-packages/ctranslate2/specs/model_spec.py", line 89, in _check
    raise ValueError("Missing value for attribute %s" % name)
ValueError: Missing value for attribute decoder/layer_3/self_attention/layer_norm/gamma

Someone know why I have this issue? Normally it’s exactly the same model that m2m100.

Sincerely.

guillaumekln · December 28, 2022, 9:45am

Hi,

The converter does not consider configurations where the number of encoder and decoder layers are different.

It will be fixed in the next version.

EDIT: we just released the fix in version 3.3.0.

Jourdelune · January 2, 2023, 7:40pm

Oh nice! I have retry but I have the same error
pip install ctranslate2
with the output:

Installing collected packages: ctranslate2
  WARNING: The scripts ct2-fairseq-converter, ct2-marian-converter, ct2-openai-gpt2-converter, ct2-opennmt-py-converter, ct2-opennmt-tf-converter, ct2-opus-mt-converter and ct2-transformers-converter are installed in '/home/jourdelune/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed ctranslate2-3.3.0

and cd /home/jourdelune/.local/bin
and finally the command:
ct2-transformers-converter --model alirezamsh/small100 --output_dir /home/jourdelune/Documents/model/
that return the same error

ValueError: Missing value for attribute decoder/layer_3/self_attention/layer_norm/gamma```

Jourdelune · January 2, 2023, 7:46pm

oops sorry It work fine! (I just don’t have choose the ct2-transformers-converter in the bin env).

dingedi · January 20, 2023, 6:20am

model doesn’t seem to work, is it a problem with model or ctranslate2? or misuse

# wget https://huggingface.co/alirezamsh/small100/raw/main/tokenization_small100.py
# ct2-transformers-converter --model alirezamsh/small100 --output_dir model
from tokenization_small100 import SMALL100Tokenizer
import ctranslate2


translator = ctranslate2.Translator("model")
tokenizer = SMALL100Tokenizer.from_pretrained("alirezamsh/small100")
tokenizer.src_lang = "en"

source = tokenizer.convert_ids_to_tokens(tokenizer.encode("Hello world, it's a small test"))
target_prefix = [tokenizer.lang_code_to_token["fr"]]
results = translator.translate_batch([source], target_prefix=[target_prefix])
target = results[0].hypotheses[0][1:]

print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))

result

CT2_VERBOSE=1 python trans.py
[2023-01-20 07:19:22.476] [ctranslate2] [thread 22087] [info] CPU: GenuineIntel (SSE4.1=true, AVX=true, AVX2=true)
[2023-01-20 07:19:22.476] [ctranslate2] [thread 22087] [info]  - Selected ISA: AVX2
[2023-01-20 07:19:22.476] [ctranslate2] [thread 22087] [info]  - Use Intel MKL: true
[2023-01-20 07:19:22.476] [ctranslate2] [thread 22087] [info]  - SGEMM backend: MKL
[2023-01-20 07:19:22.476] [ctranslate2] [thread 22087] [info]  - GEMM_S16 backend: MKL
[2023-01-20 07:19:22.476] [ctranslate2] [thread 22087] [info]  - GEMM_S8 backend: MKL (u8s8 preferred: true)
[2023-01-20 07:19:22.476] [ctranslate2] [thread 22087] [info]  - Use packed GEMM: false
[2023-01-20 07:19:22.972] [ctranslate2] [thread 22087] [info] Loaded model model on device cpu:0
[2023-01-20 07:19:22.972] [ctranslate2] [thread 22087] [info]  - Binary version: 6
[2023-01-20 07:19:22.972] [ctranslate2] [thread 22087] [info]  - Model specification revision: 6
[2023-01-20 07:19:22.972] [ctranslate2] [thread 22087] [info]  - Selected compute type: float
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'M2M100Tokenizer'. 
The class this function is called from is 'SMALL100Tokenizer'.
ku на, it's a small test

guillaumekln · January 20, 2023, 11:50am

It seems the input should be prepared differently than M2M100. In particular the target prefix should not be used. The tokenizer will include the target lang in the source input.

from tokenization_small100 import SMALL100Tokenizer
import ctranslate2

translator = ctranslate2.Translator("model")
tokenizer = SMALL100Tokenizer.from_pretrained("alirezamsh/small100")
tokenizer.src_lang = "en"
tokenizer.tgt_lang = "fr"

source = tokenizer.convert_ids_to_tokens(tokenizer.encode("Hello world, it's a small test"))
results = translator.translate_batch([source])
target = results[0].hypotheses[0]

print(tokenizer.decode(tokenizer.convert_tokens_to_ids(target)))

Note that the model is also impacted by this conversion issue: m2m100 model generates gibberish after converted by ctranslate2 with latest transformers library · Issue #1039 · OpenNMT/CTranslate2 · GitHub

You should install transformers==4.23.* before using ct2-transformers-converter.

dingedi · January 20, 2023, 1:37pm

Thanks ! this indeed solves the problem and seems to work

[2023-01-20 14:37:25.423] [ctranslate2] [thread 14741] [info] CPU: GenuineIntel (SSE4.1=true, AVX=true, AVX2=true)
[2023-01-20 14:37:25.423] [ctranslate2] [thread 14741] [info]  - Selected ISA: AVX2
[2023-01-20 14:37:25.423] [ctranslate2] [thread 14741] [info]  - Use Intel MKL: true
[2023-01-20 14:37:25.423] [ctranslate2] [thread 14741] [info]  - SGEMM backend: MKL
[2023-01-20 14:37:25.423] [ctranslate2] [thread 14741] [info]  - GEMM_S16 backend: MKL
[2023-01-20 14:37:25.423] [ctranslate2] [thread 14741] [info]  - GEMM_S8 backend: MKL (u8s8 preferred: true)
[2023-01-20 14:37:25.423] [ctranslate2] [thread 14741] [info]  - Use packed GEMM: false
[2023-01-20 14:37:26.004] [ctranslate2] [thread 14741] [info] Loaded model model on device cpu:0
[2023-01-20 14:37:26.004] [ctranslate2] [thread 14741] [info]  - Binary version: 6
[2023-01-20 14:37:26.004] [ctranslate2] [thread 14741] [info]  - Model specification revision: 6
[2023-01-20 14:37:26.004] [ctranslate2] [thread 14741] [info]  - Selected compute type: float
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'M2M100Tokenizer'. 
The class this function is called from is 'SMALL100Tokenizer'.
Bonjour, c'est un petit test