Supporting Tokenize/Detokenize automatically by Translation Server

(liluhao1982) #1


I deployed a Translation Server and made some codes to enable this server can be accessed remotely as following (C++ translator section):

As I know the input source string should be tokenized first and translation should be detokenized,

My question is: If the tokenization for Source and detokenization for translation can be executed automatically by the Translation Server? Or we need extra codes to tokenize the source before sending to Translation server and detokize the returned translation?

It seems not refer to my testing:

  1. I tokenize the source string manually - “This ’ is a testing only .”, the translation can be returned without unk, but the translation seems NOT detokenized.

  2. I did NOT tokenize the source string - “This’s a testing only.”, the translation returned with several unk, and translation seems NOT detokenized also.

Please correct me if wrong.


Japanese training
(srush) #2

I believe you are correct. @jean.senellart do we now have tokenization built-in to the standard translation.lua?

(Vincent Nguyen) #3

That’s what I did in my PR :

(jean.senellart) #4

Hello @liluhao1982 - we provide a tokenization tool here ( which is “reversible” - meaning that after translation, you do not need any external information for detokenizing.

This is available as lua library and we also ported it to C++. The only constraint is that you train this engine with this tokenization. You have different options - but the simplest is just to use -joiner_annotate as parameter of tokenize.lua.

let us know if this work for you!

(liluhao1982) #5

Thanks for your answer.

I know the corpus should be tokenized by running tokenize.lua before running preprocess.lua / train.lua.

But when call deployed NMT engine by translation_server.lua / translate.lua, the input source strings must be tokenized first, as the NMT server will be accessed remotely by our codes on Windows platform, this means we have to find solution to tokenize the source string before sending tokenized source string to NMT server and detokenize the returned translation.

I mean if translation_server.lua / translate.lua can buit-in tokenization/detokenization and can accept plaint text then do tokenizaiton / detokenizaiton automatically when call translate command. If so, that means we needn’t to do tokenization / detokenization when call translate function.

Your language-independent tokenizer/detokenizer in NMT seems great, I didn’t find a better tokenizer/detokenizer than this.

Please correct me if wrong.


(Vincent Nguyen) #6

again if you check the PR in my previous message, you will find a rest_translation_server.lua
which does the same thing as translation_server.lua BUT it will first tokenize the input and detokenize the output.
Plus, this is a rest syntax.
So your Windows app can send direclty plain text.

(liluhao1982) #7


I’ve downloaded your REST version from path below:

And then ran luarocks make rocks/opennmt-scm-1.rockspec

I try to start the server by rest_translation_server.lua, but errors appear, please see more detail in attachment.

I’ve installed necessary dependencies already and can start translation server by translation_server.lua without problem.

If there any special dependencies needed for launching rest_translation_server.lua

Thanks again.

(Vincent Nguyen) #8

sorry yes you need this:

(liluhao1982) #9

Thanks very much. Now I can launch the server by running - rest_translation_server.lua

I have another problem, how to access the server above remotely?

I can access the server launched by translation_server.lua by follow codes without problem, but when execute same codes over the server launched by rest_translation_server.lua, there is no any response from the server.

import zmq, sys, json
sock = zmq.Context().socket(zmq.REQ)
sock.send(json.dumps([{“src”: " ".join(sys.argv[1:])}]))
print sock.recv()

If there any codes sample for accessing server launched by rest_translation_server.lua


(Vincent Nguyen) #10

look at the top of the .lua script

you will see an example with curl

very simple.

(liluhao1982) #11

Thanks for the hint.

It seems the server can receive the request now, but it throws errors on client’s side and no translation returns. Any suggestion here?


(Vincent Nguyen) #12

what target laguage are you using ?
I restested it, works fine with european languages.
this is an error at detokenization time …

(liluhao1982) #13

The language pair is en-US ->es-ES. I also test other engines (en-US -> zh-CN, en-US->it-IT), i got the same error.

(Vincent Nguyen) #14

ok I know what can happen …

check the options at the beginning of the file rest_translation_server.lua

I modified the default values:
cmd:option(’-mode’, ‘conservative’, [[Define how aggressive should the tokenization be - ‘aggressive’ only keeps sequences of letters/numbers,
‘conservative’ allows mix of alphanumeric as in: ‘2,000’, ‘E65’, ‘soft-landing’]])
cmd:option(’-joiner_annotate’, true, [[Include joiner annotation using ‘joiner’ character]])
cmd:option(’-joiner’, separators.joiner_marker, [[Character used to annotate joiners]])
cmd:option(’-joiner_new’, false, [[in joiner_annotate mode, ‘joiner’ is an independent token]])
cmd:option(’-case_feature’, true, [[Generate case feature]])

You need to make them match your model setting
sorry for that

(liluhao1982) #15

Thanks very much, it works now.

Hope this can be released soon.

Have nice day.

(liluhao1982) #16

Hi again,

I met another issue when ran rest_translation_server.lua on another engine after incremental training by following the process on below article:

Error detail:

translation_server.lua can launch the same engine without problem.

Do you have any suggestion?

Thanks for your patience.

Rest server throughput
(Vincent Nguyen) #17

you seem to miss tds
try luarocks install tds

(liluhao1982) #18

If is it possible to add a new option for enable/disable detokenization for Translation when deploy rest server?

I found the detokenization for Translation is embeded in OpenNMT by default when return translation.

I asked this as we usually need the tokenized instead of detokenized translations in our application.

If your tokenizer/detokenizer has been published as dll/library which can be called by other application?
Thanks for your support.