BPE and other tokenization options with rest_translation_server.lua


(Itziar) #1

Hi all,

I’m doing some tests with BPE, and I’d want to put my best ‘system’ in a server, like a demo. So I’m trying to use rest_translation_server.lua to execute the model, and use it from a client.

I’m able to execute a ‘simple’ model with server/client architecture, but I can’t reproduce the results I obtained with BPE. I used some ‘extra’ parameters with translate.lua script, such as -tok_src_bpe_model, -tok_tgt_bpe_model, -tok_tgt_joiner_annotate and -detokenize_output. But I can’t use these parameters with rest_translation_server.lua.

I’ve done a little test with -bpe_model parameter, setting my source_bpe_model path, but the results are not the expected (so I think I’m not setting this parameter propperly).

So, I want to confirm if I can’t start the server usign these parameters (or similar), or I have to preprocess each text before translate, and postprocess (detokenize) after translation.

Could you help me with it?
Thanks in advance.


Bpe model is applied before tokenization?
(Itziar) #2

Hi again,

I’ve done some more tests with it. I still get different results executing the same model with translate.lua and rest_translation_server.lua.

I realized that the tokenization is different in both cases.

Input text
Vivo en Cataluña.

translate.lua (input tokenization)
Viv o en Cat alu ñ a .

rest_translation_server.lua (input tokenization)
Viv o en Cataluña .

In both cases I use the same BPE model for input text, but I’m thinking the problem could be related with the encoding of the text.

Any clue?

Thanks in advance.


(jean.senellart) #3

Hello! It is unlikely that the problem is related to encoding - the handling of encoding in Lua is pretty simple and consistent. Can you you copy your exact command line for both translate.lua and rest_translation_server.lua and let us know the exact version you are using? Thanks


(Itziar) #4

Hello,

I’m executing the commands in this way:

th tools/rest_translation_server.lua -model $DATA_ROOT/modelj_epoch10_4.83.t7 -bpe_model $DATA_ROOT/codes_es -gpuid 1 -replace_unk 1 -port 7777

th translate.lua -model $DATA_ROOT/modelj_epoch10_4.83.t7 -tok_src_bpe_model $DATA_ROOT/codes_es -tok_tgt_bpe_model $DATA_ROOT/codes_eu -gpuid 1 -src kk.txt -tok_tgt_joiner_annotate true -replace_unk true -detokenize_output true

I executed both commands in the same machine, and Lua versión is:
$ lua -v
Lua 5.2.3 Copyright © 1994-2013 Lua.org, PUC-Rio

Thanks for the reply!


(Guillaume Klein) #5

Hi,

There are some tokenization options that are currently missing on rest_translation_server.lua (more precisely, tokenization options are shared for source and target). In theory, you should also set -tok_tgt_joiner_annotate true but it is missing as of now.

We are working on it.


(Itziar) #6

Okay. I understand that I can’t execute the same in server mode for now.

Thank you for the reply.