BPE and other tokenization options with rest_translation_server.lua

icortes · December 18, 2017, 2:07pm

Hi all,

I’m doing some tests with BPE, and I’d want to put my best ‘system’ in a server, like a demo. So I’m trying to use rest_translation_server.lua to execute the model, and use it from a client.

I’m able to execute a ‘simple’ model with server/client architecture, but I can’t reproduce the results I obtained with BPE. I used some ‘extra’ parameters with translate.lua script, such as -tok_src_bpe_model, -tok_tgt_bpe_model, -tok_tgt_joiner_annotate and -detokenize_output. But I can’t use these parameters with rest_translation_server.lua.

I’ve done a little test with -bpe_model parameter, setting my source_bpe_model path, but the results are not the expected (so I think I’m not setting this parameter propperly).

So, I want to confirm if I can’t start the server usign these parameters (or similar), or I have to preprocess each text before translate, and postprocess (detokenize) after translation.

Could you help me with it?
Thanks in advance.

icortes · December 22, 2017, 11:56am

Hi again,

I’ve done some more tests with it. I still get different results executing the same model with translate.lua and rest_translation_server.lua.

I realized that the tokenization is different in both cases.

Input text
Vivo en Cataluña.

translate.lua (input tokenization)
Viv o en Cat alu ñ a .

rest_translation_server.lua (input tokenization)
Viv o en Cataluña .

In both cases I use the same BPE model for input text, but I’m thinking the problem could be related with the encoding of the text.

Any clue?

Thanks in advance.

jean.senellart · December 22, 2017, 10:21pm

Hello! It is unlikely that the problem is related to encoding - the handling of encoding in Lua is pretty simple and consistent. Can you you copy your exact command line for both translate.lua and rest_translation_server.lua and let us know the exact version you are using? Thanks

icortes · December 26, 2017, 5:14pm

Hello,

I’m executing the commands in this way:

th tools/rest_translation_server.lua -model $DATA_ROOT/modelj_epoch10_4.83.t7 -bpe_model $DATA_ROOT/codes_es -gpuid 1 -replace_unk 1 -port 7777

th translate.lua -model $DATA_ROOT/modelj_epoch10_4.83.t7 -tok_src_bpe_model $DATA_ROOT/codes_es -tok_tgt_bpe_model $DATA_ROOT/codes_eu -gpuid 1 -src kk.txt -tok_tgt_joiner_annotate true -replace_unk true -detokenize_output true

Thanks for the reply!

guillaumekln · January 2, 2018, 8:44am

Hi,

There are some tokenization options that are currently missing on rest_translation_server.lua (more precisely, tokenization options are shared for source and target). In theory, you should also set -tok_tgt_joiner_annotate true but it is missing as of now.

We are working on it.

icortes · January 2, 2018, 9:33am

Okay. I understand that I can’t execute the same in server mode for now.

Thank you for the reply.