BPE and other tokenization options with rest_translation_server.lua

Hi all,

I’m doing some tests with BPE, and I’d want to put my best ‘system’ in a server, like a demo. So I’m trying to use rest_translation_server.lua to execute the model, and use it from a client.

I’m able to execute a ‘simple’ model with server/client architecture, but I can’t reproduce the results I obtained with BPE. I used some ‘extra’ parameters with translate.lua script, such as -tok_src_bpe_model, -tok_tgt_bpe_model, -tok_tgt_joiner_annotate and -detokenize_output. But I can’t use these parameters with rest_translation_server.lua.

I’ve done a little test with -bpe_model parameter, setting my source_bpe_model path, but the results are not the expected (so I think I’m not setting this parameter propperly).

So, I want to confirm if I can’t start the server usign these parameters (or similar), or I have to preprocess each text before translate, and postprocess (detokenize) after translation.

Could you help me with it?
Thanks in advance.

Hi again,

I’ve done some more tests with it. I still get different results executing the same model with translate.lua and rest_translation_server.lua.

I realized that the tokenization is different in both cases.

Input text
Vivo en Cataluña.

translate.lua (input tokenization)
Viv o en Cat alu ñ a .

rest_translation_server.lua (input tokenization)
Viv o en Cataluña .

In both cases I use the same BPE model for input text, but I’m thinking the problem could be related with the encoding of the text.

Any clue?

Thanks in advance.

Hello! It is unlikely that the problem is related to encoding - the handling of encoding in Lua is pretty simple and consistent. Can you you copy your exact command line for both translate.lua and rest_translation_server.lua and let us know the exact version you are using? Thanks

Hello,

I’m executing the commands in this way:

th tools/rest_translation_server.lua -model $DATA_ROOT/modelj_epoch10_4.83.t7 -bpe_model $DATA_ROOT/codes_es -gpuid 1 -replace_unk 1 -port 7777

th translate.lua -model $DATA_ROOT/modelj_epoch10_4.83.t7 -tok_src_bpe_model $DATA_ROOT/codes_es -tok_tgt_bpe_model $DATA_ROOT/codes_eu -gpuid 1 -src kk.txt -tok_tgt_joiner_annotate true -replace_unk true -detokenize_output true

I executed both commands in the same machine, and Lua versión is:
$ lua -v
Lua 5.2.3 Copyright © 1994-2013 Lua.org, PUC-Rio

Thanks for the reply!

Hi,

There are some tokenization options that are currently missing on rest_translation_server.lua (more precisely, tokenization options are shared for source and target). In theory, you should also set -tok_tgt_joiner_annotate true but it is missing as of now.

We are working on it.

Okay. I understand that I can’t execute the same in server mode for now.

Thank you for the reply.