Hi, everyone,
after having read the latest announcements on source factors (a.k.a. case features) I’m wondering if it’s possible now to train a transformer model with Opennmt-tf by using source/target case annotations similar to how it worked with former Opennmt-lua, e.g. cat [src|tgt] | tokenizer --case_feature […]
If so, would it be possible even to train it with a subword model, e.g. sentencepiece?
Thanks for any answer that makes things clearer.
Hi,
Only source factors/features are currently implemented in OpenNMT-tf, OpenNMT-py, and CTranslate2. So you cannot use target case annotations as you used in OpenNMT-lua.
The current alternative is the tokenizer option case_markup
which can be used for both the source and the target:
$ echo "Hello World!" | cli/tokenize --case_markup --soft_case_regions
⦅mrk_case_modifier_C⦆ hello ⦅mrk_case_modifier_C⦆ world ■!
$ echo "⦅mrk_case_modifier_C⦆ hello ⦅mrk_case_modifier_C⦆ world ■!" | cli/detokenize
Hello World!
This approach works well and does not require any model changes.
Hi Guillaume,
thanks for your immediate reply. BTW : great improvements, thank you!
Aah, I see : markup is protected, so it works transparently.
When dealing with subword model, e.g. sentencepiece : is it better to split first and apply markup second? Should work better IMHO when using with tokenize --segment_case - or am I missing something?
As you noticed, case_markup
requires segment_case
so it is automatically enabled:
echo "WiFi" | cli/tokenize --case_markup --soft_case_regions --joiner_annotate
⦅mrk_case_modifier_C⦆ wi■ ⦅mrk_case_modifier_C⦆ fi
Ideally the subword model should be trained and applied after this case markup tokenization so that the subword model is case insensitive. BPE always requires a pretokenization so this is how it works anyway, but SentencePiece is usually applied on the whole sentence. In that case we suggest to train SentencePiece with a pretokenization, like BPE.
You can achieve that when training the subword models with the OpenNMT Tokenizer: Tokenizer/bindings/python at master · OpenNMT/Tokenizer · GitHub
case_markup
with SentencePiece applied on the full sentence is an open issue:
Perfect answer - thanks!
Cheers, Martin