Character tokenizer with TF2 version

usually with the version TensorFlow 1, I used to launch the vocabulary extraction using the character tokenizer like this:

onmt-build-vocab --size 48000 --save_vocab vocab_char.src train.src --tokenizer CharacterTokenizer

but now I get the following error:

Traceback (most recent call last):
File “/usr/local/bin/onmt-build-vocab”, line 8, in
File “/usr/local/lib/python3.6/dist-packages/opennmt/bin/”, line 73, in main
tokenizer = tokenizers.make_tokenizer(args.tokenizer_config)
File “/usr/local/lib/python3.6/dist-packages/opennmt/tokenizers/”, line 297, in make_tokenizer
raise ValueError(“Invalid tokenization configuration: %s” % str(config))
ValueError: Invalid tokenization configuration: CharacterTokenizer

It seems I miss a lot of things, can you help me about this issue please?

Hi Christophe!

Yes, this has slightly changed. For now, --tokenizer_config expects a path to a tokenization configuration. In your case you could just do:

echo "type: CharacterTokenizer" > char_tokenization.yml
onmt-build-vocab --size 48000 --save_vocab vocab_char.src --tokenizer_config char_tokenization.yml train.src

See also

We could make this easier for simple tokenizers that don’t have any parameters.

Hi Guillaume,
It worked perfectly, thanks!

indeed, it could be simplier to do such a thing :wink: