Character tokenizer with TF2 version

cservan · March 17, 2020, 9:16am

Hello,
usually with the version TensorFlow 1, I used to launch the vocabulary extraction using the character tokenizer like this:

onmt-build-vocab --size 48000 --save_vocab vocab_char.src train.src --tokenizer CharacterTokenizer

but now I get the following error:

Traceback (most recent call last):
File “/usr/local/bin/onmt-build-vocab”, line 8, in
sys.exit(main())
File “/usr/local/lib/python3.6/dist-packages/opennmt/bin/build_vocab.py”, line 73, in main
tokenizer = tokenizers.make_tokenizer(args.tokenizer_config)
File “/usr/local/lib/python3.6/dist-packages/opennmt/tokenizers/tokenizer.py”, line 297, in make_tokenizer
raise ValueError(“Invalid tokenization configuration: %s” % str(config))
ValueError: Invalid tokenization configuration: CharacterTokenizer

It seems I miss a lot of things, can you help me about this issue please?

guillaumekln · March 17, 2020, 9:51am

Hi Christophe!

Yes, this has slightly changed. For now, --tokenizer_config expects a path to a tokenization configuration. In your case you could just do:

echo "type: CharacterTokenizer" > char_tokenization.yml
onmt-build-vocab --size 48000 --save_vocab vocab_char.src --tokenizer_config char_tokenization.yml train.src

See also https://opennmt.net/OpenNMT-tf/tokenization.html

We could make this easier for simple tokenizers that don’t have any parameters.

cservan · March 17, 2020, 10:53am

Hi Guillaume,
It worked perfectly, thanks!

indeed, it could be simplier to do such a thing