How to use case_feature option properly?

Hello!

I have some questions about using case_feature option.
I put -case_feature and -segment_case options when I tokenized and trained data.

These are my scripts:

th tools/tokenize.lua -case_feature true -segment_case true < data/en_train.txt > data/output_en.tok.txt

th train.lua -data data/data-train.t7 -save_model model -gpuid 1 2 -layers 8 -rnn_size 1000 -tok_src_case_feature true -tok_src_segment_case true  > log.txt

th tools/rest_translation_server.lua -model model_checkpoint.t7 -host xxx -port xxxx -case_feature true -segment_case true -replace_unk_tagged -gpuid 2

My first question is that I have to put -case_feature and -segment_case options together?
What’s the differences if I put only -case_feature option without -segment_case option?

Also I got an error when requested to this server.

500 Internal Server Error - Error in application: tools/rest_translation_server.lua:99: unicode error in line ./tools/utils/case.lua:83: assertion failed!

Please help me to fix this problem.
Thank you!

Hi,

Usually it is a good idea to also set -segment_case when using case features on the target side. It ensures that mixed cased words (e.g. WiFi) can be correctly restored. See the documentation.

Note that you should also set -joiner_annotate for the tokenization to be reversible.


Regarding the error, did you also tokenize the target text with case feature?

Additionally, you don’t have to set -tok_src_case_feature true -tok_src_segment_case true during training as your data are already tokenized.

Hi, guillaumekln!

Thank you for reply!

Actually I didn’t put -case_feature option on target side.
My source side language is English and target side is Korean which is not based on Alphabet, so I used another Tokenizer suited Korean.

Do I have to use same tokenizer for both side if i want to use -case_feature option?

The REST translation server currently expects the case feature to be used on both sides.

You can still change the code and disable case feature for the detokenization:

diff --git a/tools/rest_translation_server.lua b/tools/rest_translation_server.lua
index a53284c..f7b2be1 100644
--- a/tools/rest_translation_server.lua
+++ b/tools/rest_translation_server.lua
@@ -46,6 +46,8 @@ cmd:text("")
 cmd:option('-batch_size', 64, [[Size of each parallel batch - you should not change except if low memory.]])

 local opt = cmd:parse(arg)
+local detok_opt = onmt.utils.Table.deepCopy(opt)
+detok_opt.case_feature = false

 local function translateMessage(translator, lines)
   local bpe
@@ -109,7 +111,7 @@ local function translateMessage(translator, lines)
         local srcSent = translator:buildOutput(batch[b])
         local predSent
         res, err = pcall(function()
-          predSent = tokenizer.detokenize(opt,
+          predSent = tokenizer.detokenize(detok_opt,
                                           results[b].preds[bi].words,
                                           results[b].preds[bi].features)
         end)
                                       results[b].preds[bi].features)

I changed the code you gave for me.

After change that code, I got a right response for only first request.
From second request I got another error like below.

500 Internal Server Error - Error in application: ./onmt/utils/Features.lua:61: expected 1 source features, got 0

Here is an another question about case feature option.
What is going to happen if I use tokenize.lua using -case_feature option for Korean?
Have you tried to use it for another language which is not shared Alphabet?

See the updated diff in the post above, the previous one was too naive.

It will just assign the “None” case to each token.

Hi Guillaume!

Sorry to raise your attention about this old post. We have seen you have changed the rest translation server lua in order to disable the caseing only (i think)

We have seen in another post you were thinking to add some feature in order to disable input tokenization as someone was asking to use its own tokenisation. Not sure if a hook is the “official” way in order to do so, but I can imagine several reasons to use another solutions.

The fact is that if we run the rest_transaltion_server without any segmentation/casing flag:

th tools/rest_translation_server.lua -port 4031 -model /home/…och13_1.31.t7 -gpuid 1 -replace_unk

and we try to translate:

curl -v -H …-X POST -d ‘[{ “src” : “dels│L productes│L electrònics│L en│L la│L societat│L ■.│N” }]’ http://loc

We get a:

.

…onmt/utils/Features.lua:61: expected 1 source features, got 0

So, hope is not a difficult question, but , is it an easy way to change the code in order to disable the input tokenisation? Or any workarround besides a “hook”? The idea is to use the same input as the translation model uses.

Thanks in advance!
have a nice day
miguel canals

Hi,

Did you try starting the server with the option -mode space to enable simple token splitting?

Hi Guillaume

Thanks a lot for your fast answer. Maybe we are doing something wrong, but, for instance if I want to translate:

3│N ■.■│N 1│N correspon│C a│L la│L subdirecció│C general│C …

Using the “mode space”:

th tools/rest_translation_server.lua -host 127.0.0.1 -model /home/…t7
-replace_unk -mode space -gpuid 1

indeed we can use it as untokenize input stream, as in:

curl -v -H “Content-Type: application/json” -X POST
-d ‘[{ “src” : “3│N ■.■│N 1│N correspon│C a│L la│L…” }]’
http://127.0.0.1:7784/translator/translate

But the result is missing case info (pls notice all words in lowercase in tgt)

[[{“tgt”:“3.1 corresponde a la subdirección general…”,
“pred_score”:-2.3162937164307,
“n_best”:1,
“src”:“3│N ■.■│N 1│N correspon│C a│L…”}]]

If I run the tokenizer rest_translation version:

th tools/rest_translation_server.lua -host 127.0.0.1 -model /home/…t7
-case_feature -replace_unk -joiner_annotate true -joiner ■ -mode aggressive -gpuid 1

using as input the untoknized stream:

curl -v -H “Content-Type: application/json” -X POST
-d ‘[{ “src” : “3.1 Correspon a la Subdirecció General d\u0027Inter…:” }]’
http://127.0.0.1:7784/translator/translate

The result is correct (all uppercase are correct)

[[{“pred_score”:-2.3162937164307,
“tgt”:“3.1 Corresponde a la Subdirección General de Inter…:”,
“src”:“3│N ■.■│N 1│N correspon│C a│L la│L subdirecció…”,
“n_best”:1}]]

So, first question, why the case info is ignored?

Second question, even with the “mode space”, target is still detokenized. I guess this is working as designed. What will happen if the target translation has more features? They will be lost? We think it shoud be possible to stream the model flow without been tokenized/untokenized in both sides (humble opinion though! :blush:)

Excuse us if we have missed some option. Thansk a lot!
have a nice day
miguel canals

I suggest modifying the code to bypass the target detokenization. Replace this chunk:

by:

local predSent = translator:buildOutput(results[b].preds[bi])

Let me know if that works for you.

Yes!!! :clap: :+1: Looks fine! Thanks a lot
have a nice day!
miguel canals