How to use case_feature option properly?



I have some questions about using case_feature option.
I put -case_feature and -segment_case options when I tokenized and trained data.

These are my scripts:

th tools/tokenize.lua -case_feature true -segment_case true < data/en_train.txt > data/output_en.tok.txt

th train.lua -data data/data-train.t7 -save_model model -gpuid 1 2 -layers 8 -rnn_size 1000 -tok_src_case_feature true -tok_src_segment_case true  > log.txt

th tools/rest_translation_server.lua -model model_checkpoint.t7 -host xxx -port xxxx -case_feature true -segment_case true -replace_unk_tagged -gpuid 2

My first question is that I have to put -case_feature and -segment_case options together?
What’s the differences if I put only -case_feature option without -segment_case option?

Also I got an error when requested to this server.

500 Internal Server Error - Error in application: tools/rest_translation_server.lua:99: unicode error in line ./tools/utils/case.lua:83: assertion failed!

Please help me to fix this problem.
Thank you!

(Guillaume Klein) #2


Usually it is a good idea to also set -segment_case when using case features on the target side. It ensures that mixed cased words (e.g. WiFi) can be correctly restored. See the documentation.

Note that you should also set -joiner_annotate for the tokenization to be reversible.

Regarding the error, did you also tokenize the target text with case feature?

Additionally, you don’t have to set -tok_src_case_feature true -tok_src_segment_case true during training as your data are already tokenized.


Hi, guillaumekln!

Thank you for reply!

Actually I didn’t put -case_feature option on target side.
My source side language is English and target side is Korean which is not based on Alphabet, so I used another Tokenizer suited Korean.

Do I have to use same tokenizer for both side if i want to use -case_feature option?

(Guillaume Klein) #4

The REST translation server currently expects the case feature to be used on both sides.

You can still change the code and disable case feature for the detokenization:

diff --git a/tools/rest_translation_server.lua b/tools/rest_translation_server.lua
index a53284c..f7b2be1 100644
--- a/tools/rest_translation_server.lua
+++ b/tools/rest_translation_server.lua
@@ -46,6 +46,8 @@ cmd:text("")
 cmd:option('-batch_size', 64, [[Size of each parallel batch - you should not change except if low memory.]])

 local opt = cmd:parse(arg)
+local detok_opt = onmt.utils.Table.deepCopy(opt)
+detok_opt.case_feature = false

 local function translateMessage(translator, lines)
   local bpe
@@ -109,7 +111,7 @@ local function translateMessage(translator, lines)
         local srcSent = translator:buildOutput(batch[b])
         local predSent
         res, err = pcall(function()
-          predSent = tokenizer.detokenize(opt,
+          predSent = tokenizer.detokenize(detok_opt,


I changed the code you gave for me.

After change that code, I got a right response for only first request.
From second request I got another error like below.

500 Internal Server Error - Error in application: ./onmt/utils/Features.lua:61: expected 1 source features, got 0

Here is an another question about case feature option.
What is going to happen if I use tokenize.lua using -case_feature option for Korean?
Have you tried to use it for another language which is not shared Alphabet?

(Guillaume Klein) #6

See the updated diff in the post above, the previous one was too naive.

It will just assign the “None” case to each token.