Opennmt-tf : train a transformer model with case factors (a.k.a. case features) possible?

baume · March 17, 2022, 4:32pm

Hi, everyone,
after having read the latest announcements on source factors (a.k.a. case features) I’m wondering if it’s possible now to train a transformer model with Opennmt-tf by using source/target case annotations similar to how it worked with former Opennmt-lua, e.g. cat [src|tgt] | tokenizer --case_feature […]
If so, would it be possible even to train it with a subword model, e.g. sentencepiece?
Thanks for any answer that makes things clearer.

guillaumekln · March 17, 2022, 4:42pm

Hi,

Only source factors/features are currently implemented in OpenNMT-tf, OpenNMT-py, and CTranslate2. So you cannot use target case annotations as you used in OpenNMT-lua.

The current alternative is the tokenizer option case_markup which can be used for both the source and the target:

$ echo "Hello World!" | cli/tokenize --case_markup --soft_case_regions
｟mrk_case_modifier_C｠ hello ｟mrk_case_modifier_C｠ world ￭!

$ echo "｟mrk_case_modifier_C｠ hello ｟mrk_case_modifier_C｠ world ￭!" | cli/detokenize 
Hello World!

This approach works well and does not require any model changes.

baume · March 17, 2022, 4:51pm

Hi Guillaume,
thanks for your immediate reply. BTW : great improvements, thank you!
Aah, I see : markup is protected, so it works transparently.
When dealing with subword model, e.g. sentencepiece : is it better to split first and apply markup second? Should work better IMHO when using with tokenize --segment_case - or am I missing something?

guillaumekln · March 17, 2022, 5:22pm

As you noticed, case_markup requires segment_case so it is automatically enabled:

echo "WiFi" | cli/tokenize --case_markup --soft_case_regions --joiner_annotate
｟mrk_case_modifier_C｠ wi￭ ｟mrk_case_modifier_C｠ fi

Ideally the subword model should be trained and applied after this case markup tokenization so that the subword model is case insensitive. BPE always requires a pretokenization so this is how it works anyway, but SentencePiece is usually applied on the whole sentence. In that case we suggest to train SentencePiece with a pretokenization, like BPE.

You can achieve that when training the subword models with the OpenNMT Tokenizer: Tokenizer/bindings/python at master · OpenNMT/Tokenizer · GitHub

case_markup with SentencePiece applied on the full sentence is an open issue:

github.com/OpenNMT/Tokenizer

space/none mode potentiel issue with case_markup

opened 01:03PM - 09 Oct 20 UTC

Zenglinxiao

enhancement

When using `case_markup` in `space`/`none` mode, unexpected behavior happens: `…``python >>> pyonmttok.Tokenizer("none", case_markup=True).tokenize("你好世界，这是一个Test。") ... (['｟mrk_case_modifier_C｠', '你好世界，这是一个test。'], None) >>> pyonmttok.Tokenizer("none", case_markup=True).detokenize(['｟mrk_case_modifier_C｠', '你好世界，这是一个test。']) ... '你好世界，这是一个test。' ``` As you can see, `.detokenize` can not rebuild the original text. Same behavior exists for `space`. While mode `conservative` or `aggressive` does not suffer this issue. But the result compare to no `case_markup` is not consistent, as they split the text to insert markup placeholder. ```python >>> pyonmttok.Tokenizer("conservative").tokenize("你好世界，这是一个Test。") ... (['你好世界', '，', '这是一个Test', '。'], None) >>> pyonmttok.Tokenizer("conservative", case_markup=True).tokenize("你好世界，这是一个Test。") ... (['你好世界', '，', '这是一个', '｟mrk_case_modifier_C｠', 'test', '。'], None) ```

baume · March 17, 2022, 5:25pm

Perfect answer - thanks!
Cheers, Martin