Opennmt-tf : train a transformer model with case factors (a.k.a. case features) possible?

Hi, everyone,
after having read the latest announcements on source factors (a.k.a. case features) I’m wondering if it’s possible now to train a transformer model with Opennmt-tf by using source/target case annotations similar to how it worked with former Opennmt-lua, e.g. cat [src|tgt] | tokenizer --case_feature […]
If so, would it be possible even to train it with a subword model, e.g. sentencepiece?
Thanks for any answer that makes things clearer.

1 Like

Hi,

Only source factors/features are currently implemented in OpenNMT-tf, OpenNMT-py, and CTranslate2. So you cannot use target case annotations as you used in OpenNMT-lua.

The current alternative is the tokenizer option case_markup which can be used for both the source and the target:

$ echo "Hello World!" | cli/tokenize --case_markup --soft_case_regions
⦅mrk_case_modifier_C⦆ hello ⦅mrk_case_modifier_C⦆ world ■!

$ echo "⦅mrk_case_modifier_C⦆ hello ⦅mrk_case_modifier_C⦆ world ■!" | cli/detokenize 
Hello World!

This approach works well and does not require any model changes.

1 Like

Hi Guillaume,
thanks for your immediate reply. BTW : great improvements, thank you!
Aah, I see : markup is protected, so it works transparently.
When dealing with subword model, e.g. sentencepiece : is it better to split first and apply markup second? Should work better IMHO when using with tokenize --segment_case - or am I missing something?

As you noticed, case_markup requires segment_case so it is automatically enabled:

echo "WiFi" | cli/tokenize --case_markup --soft_case_regions --joiner_annotate
⦅mrk_case_modifier_C⦆ wi■ ⦅mrk_case_modifier_C⦆ fi

Ideally the subword model should be trained and applied after this case markup tokenization so that the subword model is case insensitive. BPE always requires a pretokenization so this is how it works anyway, but SentencePiece is usually applied on the whole sentence. In that case we suggest to train SentencePiece with a pretokenization, like BPE.

You can achieve that when training the subword models with the OpenNMT Tokenizer: Tokenizer/bindings/python at master · OpenNMT/Tokenizer · GitHub

case_markup with SentencePiece applied on the full sentence is an open issue:

Perfect answer - thanks!
Cheers, Martin

1 Like