Yes, thanks for the heads up.
However, I can’t get it right: Here is what I do:
To use the on-the-fly tokenization, I have trained externally the sp model and included in:
File data.yml:
train_features_file: path/src-train.txt
train_labels_file: path/tgt-train.txt
eval_features_file:path/src-val.txt
eval_labels_file: path/tgt-val.txt
source_tokenization: path/tok.yml
target_tokenization: path/tok.yml
&
File tok.yml
source_tokenization:
mode: none
sp_model_path:: path/sp.model
target_tokenization:
mode: none
sp_model_path:: path/sp.model
(don’t know if it makes sense to use: --joiner_annotate: true & --case_feature: true when using SP)
And when I try to build the vocabularies:
onmt-build-vocab --tokenizer_config path/tok.yml --size 50000 --save_vocab path/src-vocab.txt path/src-train.txt
I get the error:
Traceback (most recent call last):
File “/usr/local/bin/onmt-build-vocab”, line 10, in
sys.exit(main())
File “/usr/local/lib/python3.5/dist-packages/opennmt/bin/build_vocab.py”, line 40, in main
tokenizer = tokenizers.build_tokenizer(args)
File “/usr/local/lib/python3.5/dist-packages/opennmt/tokenizers/init.py”, line 38, in build_tokenizer
return tokenizer_class(configuration_file_or_key=args.tokenizer_config)
File “/usr/local/lib/python3.5/dist-packages/opennmt/tokenizers/opennmt_tokenizer.py”, line 36, in init
self._tokenizer = create_tokenizer(self._config)
File “/usr/local/lib/python3.5/dist-packages/opennmt/tokenizers/opennmt_tokenizer.py”, line 28, in create_tokenizer
return pyonmttok.Tokenizer(mode, **kwargs)
TypeError: init(): incompatible constructor arguments. The following argument types are supported:
1. pyonmttok.Tokenizer(mode: str, bpe_model_path: str=’’, bpe_vocab_path: str=’’, bpe_vocab_threshold: int=50, vocabulary_path: str=’’, vocabulary_threshold: int=0, sp_model_path: str=’’, sp_nbest_size: int=0, sp_alpha: float=0.1, joiner: str=‘■’, joiner_annotate: bool=False, joiner_new: bool=False, spacer_annotate: bool=False, spacer_new: bool=False, case_feature: bool=False, case_markup: bool=False, no_substitution: bool=False, preserve_placeholders: bool=False, preserve_segmented_tokens: bool=False, segment_case: bool=False, segment_numbers: bool=False, segment_alphabet_change: bool=False, segment_alphabet: list=[])
Invoked with: ‘conservative’; kwargs: source_tokenization={‘mode’: ‘none’, ‘sp_model_path’: ‘path/sp.model’}, target_tokenization={‘mode’: ‘none’, ‘sp_model_path’: ‘path/sp.model’}
As an alternative, I also tried to run tokenization offline, i.e. to first tokenize train data and then build the vocabularies, but I am not sure how to do this:
Is it with onmt-tokenize-text ? If yes how do I pass the files?
Thanks in advance,