Failing conversion of Small100 (SMALL100Tokenizer does not exist or is not currently imported)

seelebrn · July 31, 2023, 5:08pm

Hello !
This issue is vaguely similar to this one, but I’m getting a different error message, and I think I’m overlooking something very simple, so sorry in advance.

I’ve fine-tuned Small100 for a specific translation task, updated transformers and CTranslate to their latest versions, and still, when running :

ct2-transformers-converter --model "\Small100" --output_dir "\CT-Small100"

I’m getting the following exception :

Traceback (most recent call last):
  File "C:\Users\Cadenza\AppData\Local\Programs\Python\Python310\Lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Cadenza\AppData\Local\Programs\Python\Python310\Lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\Cadenza\AppData\Local\Programs\Python\Python310\Scripts\ct2-transformers-converter.exe\__main__.py", line 7, in <module>
  File "C:\Users\Cadenza\AppData\Local\Programs\Python\Python310\lib\site-packages\ctranslate2\converters\transformers.py", line 1610, in main
    converter.convert_from_args(args)
  File "C:\Users\Cadenza\AppData\Local\Programs\Python\Python310\lib\site-packages\ctranslate2\converters\converter.py", line 57, in convert_from_args
    return self.convert(
  File "C:\Users\Cadenza\AppData\Local\Programs\Python\Python310\lib\site-packages\ctranslate2\converters\converter.py", line 96, in convert
    model_spec = self._load()
  File "C:\Users\Cadenza\AppData\Local\Programs\Python\Python310\lib\site-packages\ctranslate2\converters\transformers.py", line 121, in _load
    tokenizer = self.load_tokenizer(
  File "C:\Users\Cadenza\AppData\Local\Programs\Python\Python310\lib\site-packages\ctranslate2\converters\transformers.py", line 143, in load_tokenizer
    return tokenizer_class.from_pretrained(model_name_or_path, **kwargs)
  File "C:\Users\Cadenza\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 699, in from_pretrained
    raise ValueError(
ValueError: Tokenizer class SMALL100Tokenizer does not exist or is not currently imported.

Sorry again if it’s very simple, but any help would be appreciated !

Thanks in advance,

guillaumekln · August 1, 2023, 5:32am

Hi,

Are all tokenizer related configurations and files included in the fine-tuned model directory?

seelebrn · August 1, 2023, 6:19am

Hi !
Thanks for your answer !

As much as I can guess, I suppose so. Files in this folder are :

-config.json
-pytorch_model.bin
-sentencepiece.bpe.model
-special_tokens_map.json
-tokenizer_config.json
-training_args.bin
-vocab.json

Would I be missing something ?

guillaumekln · August 1, 2023, 7:25am

The base model also has a file named tokenization_small100.py.

seelebrn · August 1, 2023, 1:33pm

Indeed, and I have it ! I’m just unsure about how to pass it to the conversion tool.
Could you help me with this ?

guillaumekln · August 1, 2023, 1:51pm

The CTranslate2 converter simply loads the tokenizer with transformers.AutoTokenizer.from_pretrained.

But I don’t know how to register custom tokenizers to the AutoTokenizer. I suggest that you ask this question on the Transformers forum:

seelebrn · August 1, 2023, 6:03pm

Alright, thank you very much for these directions !