Issue with special character U+FF5F

annasamt · August 8, 2019, 10:39am

Hello,
we use “｟TAG｠” (i.e. special characters U+FF5F & U+FF60) to normalize tags, in train, val and test data, (Special Characters session in http://opennmt.net/OpenNMT/tools/tokenization/#special-characters).
When using OpenNMT-LuaTorch, and with the provided in the framework tokenization, translations where maintaining these characters in the output.
However when using OpenNMT-tf, and with SentencePiece, they are not - instead, another character is in the output: U+2985 and U+2986.
Could it be because of the preceding SentencePiece? I can of course fix the output with a script to replace the characters. However, does anyone have any idea why this happens in the first place?

Thanks in advance,
Anna

guillaumekln · August 8, 2019, 11:12am

Are you using SentencePiece directly from https://github.com/google/sentencepiece? If yes, you should be using the OpenNMT tokenizer instead:

annasamt · August 8, 2019, 1:09pm

Yes, exactly, I use SP from that link. I will try your suggestion.
Makes perfect sense, thanks!

guillaumekln · August 9, 2019, 8:15am

For reference, the tokenization mode “none” should be used to replicate spm_encode. See:

github.com

OpenNMT/Tokenizer/blob/master/docs/options.md#sp_model_path-string-default-

# Tokenization options

This file documents the options of the Tokenizer interface which can be used in:

* command line client
* C++ API
* Python API

*The exact name format of each option may be different depending on the API used.*

## Terminology

* **joiner**: special character indicating that the surrounding tokens should be merged when detokenized
* **spacer**: special character indicating that a space should be introduced when detokenized
* **placeholder** (or **protected sequence**): sequence of characters delimited by ｟ and ｠ that should not be segmented

## General

### `mode` (string, required)

This file has been truncated. show original

annasamt · August 9, 2019, 10:18am

Yes, thanks for the heads up.
However, I can’t get it right: Here is what I do:

To use the on-the-fly tokenization, I have trained externally the sp model and included in:

File data.yml:
train_features_file: path/src-train.txt
train_labels_file: path/tgt-train.txt
eval_features_file:path/src-val.txt
eval_labels_file: path/tgt-val.txt
source_tokenization: path/tok.yml
target_tokenization: path/tok.yml

&

File tok.yml
source_tokenization:
mode: none
sp_model_path:: path/sp.model
target_tokenization:
mode: none
sp_model_path:: path/sp.model

(don’t know if it makes sense to use: --joiner_annotate: true & --case_feature: true when using SP)

And when I try to build the vocabularies:
onmt-build-vocab --tokenizer_config path/tok.yml --size 50000 --save_vocab path/src-vocab.txt path/src-train.txt

I get the error:

Traceback (most recent call last):
File “/usr/local/bin/onmt-build-vocab”, line 10, in
sys.exit(main())
File “/usr/local/lib/python3.5/dist-packages/opennmt/bin/build_vocab.py”, line 40, in main
tokenizer = tokenizers.build_tokenizer(args)
File “/usr/local/lib/python3.5/dist-packages/opennmt/tokenizers/init.py”, line 38, in build_tokenizer
return tokenizer_class(configuration_file_or_key=args.tokenizer_config)
File “/usr/local/lib/python3.5/dist-packages/opennmt/tokenizers/opennmt_tokenizer.py”, line 36, in init
self._tokenizer = create_tokenizer(self._config)
File “/usr/local/lib/python3.5/dist-packages/opennmt/tokenizers/opennmt_tokenizer.py”, line 28, in create_tokenizer
return pyonmttok.Tokenizer(mode, **kwargs)
TypeError: init(): incompatible constructor arguments. The following argument types are supported:
1. pyonmttok.Tokenizer(mode: str, bpe_model_path: str=’’, bpe_vocab_path: str=’’, bpe_vocab_threshold: int=50, vocabulary_path: str=’’, vocabulary_threshold: int=0, sp_model_path: str=’’, sp_nbest_size: int=0, sp_alpha: float=0.1, joiner: str=‘￭’, joiner_annotate: bool=False, joiner_new: bool=False, spacer_annotate: bool=False, spacer_new: bool=False, case_feature: bool=False, case_markup: bool=False, no_substitution: bool=False, preserve_placeholders: bool=False, preserve_segmented_tokens: bool=False, segment_case: bool=False, segment_numbers: bool=False, segment_alphabet_change: bool=False, segment_alphabet: list=[])

Invoked with: ‘conservative’; kwargs: source_tokenization={‘mode’: ‘none’, ‘sp_model_path’: ‘path/sp.model’}, target_tokenization={‘mode’: ‘none’, ‘sp_model_path’: ‘path/sp.model’}

As an alternative, I also tried to run tokenization offline, i.e. to first tokenize train data and then build the vocabularies, but I am not sure how to do this:
Is it with onmt-tokenize-text ? If yes how do I pass the files?

Thanks in advance,

guillaumekln · August 9, 2019, 10:22am

Change tok.yml to:

mode: none
sp_model_path: path/sp.model

annasamt · August 9, 2019, 10:56am

OK it works! So vocabularies are generated (sp ones)

So now I need add to data.yml two more lines under data:
source_words_vocabulary: path/src-vocab.txt
target_words_vocabulary: path/tgt-vocab.txt

and proceed with the training, with the command:
CUDA_VISIBLE_DEVICES=1 onmt-main train_and_eval --model_type Transformer --tokenizer_config tok.yml --config data.yml --auto_config

or perhaps, tokenizer_config tok.yml is not needed here, since it is in the data.yml ?

guillaumekln · August 9, 2019, 11:02am

Do you confirm that src-train.txt and tgt-train.txt are raw data?

or perhaps, tokenizer_config tok.yml is not needed here, since it is in the data.yml ?

It’s not needed.

annasamt · August 9, 2019, 11:30am

Yes, src/tgt-train.txt is raw data in the sense that it is has not been tokenized before by any means.
Same goes for src/tgt-val.txt.
I guess that these files will be tokenized on the fly, right before the actual training, right?

Regarding the test and other source files to be translated, by onmt-main infer, shall I also skip the
–tokenizer_config, i.e. will it again the information on how go be tokenized be taken from the same data.yml ? i.e.:

onmt-main infer
–auto_config --config path/data.yml
–features_file path/src-test.txt
–predictions_file path/pred-test.txt
–model_type Transformer
–checkpoint_path path/model.ckpt-26000

Or should I specify it?

(Btw I’ve noticed that in http://opennmt.net/OpenNMT-tf/inference.html there is no mention of --model_type, but when I skip it I get error message that type of model is not specified)

guillaumekln · August 9, 2019, 11:44am

Yes.

Yes. And the test file should also be raw data.

annasamt · August 9, 2019, 12:12pm

Finally, how can I tokenize raw data files offline thought OpenNMT-tf (if say I need to keep the tokenized files)? In the documentation, under Offline usage http://opennmt.net/OpenNMT-tf/tokenization.html#offline-usag, I only see option of typed text in command line by onmt-tokenize-text, how do I pass an input file and its output? I run it with -h but there is no such option.

guillaumekln · August 9, 2019, 12:20pm

Do you mean something like:

onmt-tokenize-text --tokenizer_config tok.yml < src-train.txt > src-train.txt.tok

?

annasamt · August 9, 2019, 12:23pm

Oh, was it THAT simple…! Of course…

Thanks very much for the support.

Btw I notice that in http://opennmt.net/OpenNMT-tf/tokenization.html the link:
For a complete list of available options, see the Tokenizer documentation.
is not correct. Perhaps you could update it.

guillaumekln · August 9, 2019, 12:28pm

Thanks. Will fix that.

annasamt · August 9, 2019, 1:43pm

I started training with the command mentioned above, but I can see no mention in the train log of “on the fly tokenization” or sth similar that tokenization of the train & val data files is actually happening - I can only see, a copy/paste of the data.yml contents, where source/target tok is specified.

Is there a way to check it?

guillaumekln · August 9, 2019, 1:58pm

In this approach, it does not tokenize the data first and then start the training. Instead, the tokenization will be applied dynamically when building each batch.

Sorry there is no simple way to check that the data is correctly tokenized. You’ll figure that out during inference I guess.

annasamt · August 9, 2019, 2:08pm

OK, I’ll trust the approach for now and later test on the models.
I may also run another training with pretokenized data files and compare

Thanks

annasamt · August 14, 2019, 10:14am

Hello again.Just wanted to clarify the following:

1 - If I opt for online tokenization (e.g. SP), train & val files stated in data.yml have to be plain text files (i.e. not tokenized). By stating in data.yml: source/target_tokenization: /path/tok.yml (SP in my case) these will be tokenized on the fly per batch.
What about the vocabularies? Do these have to be plain files as well to be tokenized on the fly as the others, or tokenized (i.e. built by onmt-build-vocab --tokenizer_config path/tok.yml --size 50000 --save_vocab path/sp-vocab path/train.txt)

2- If I opt for offline tokenization, all files (train, val, vocabularies) have to be tokenized.
Do I still have to state in data.yml: source/target_tokenization: /path/tok.yml ?
Because on one hand I don’t want the already tokenized files to be re-tokenized however on the other, the predictions of val files during training have to be detokenized automatically, to be then evaluated by the external evaluator(s), which means that the type of tokenization or source/target has to be stated.

Can you please confirm the right actions for each case?

I also wanted to report that the parameter --tokenizer_config is not recognized when I run the command onmt-main infer which means that tok information has to be passed in the --config data.yml

Thanks in advance

guillaumekln · August 14, 2019, 6:15pm

Vocabularies contain tokens which are the results of a tokenization. So you should use the output of onmt-build-vocab.
If you manage tokenization outside of onmt-main, you should not configure {source,target}_tokenization. This will disable the automatic detokenization.

This is the expected usage of onmt-main.