Supporting Tokenize/Detokenize automatically by Translation Server

liluhao1982 · February 20, 2017, 2:27am

Hello,

I deployed a Translation Server and made some codes to enable this server can be accessed remotely as following (C++ translator section):

http://opennmt.net/Advanced/#translation

As I know the input source string should be tokenized first and translation should be detokenized,

My question is: If the tokenization for Source and detokenization for translation can be executed automatically by the Translation Server? Or we need extra codes to tokenize the source before sending to Translation server and detokize the returned translation?

It seems not refer to my testing:

I tokenize the source string manually - “This ’ is a testing only .”, the translation can be returned without unk, but the translation seems NOT detokenized.
I did NOT tokenize the source string - “This’s a testing only.”, the translation returned with several unk, and translation seems NOT detokenized also.

Please correct me if wrong.

Thanks.

srush · February 20, 2017, 2:51am

I believe you are correct. @jean.senellart do we now have tokenization built-in to the standard translation.lua?

vince62s · February 20, 2017, 6:29am

That’s what I did in my PR : https://github.com/OpenNMT/OpenNMT/pull/126

jean.senellart · February 20, 2017, 6:31am

Hello @liluhao1982 - we provide a tokenization tool here (https://github.com/OpenNMT/OpenNMT/tree/master/tools) which is “reversible” - meaning that after translation, you do not need any external information for detokenizing.

This is available as lua library and we also ported it to C++. The only constraint is that you train this engine with this tokenization. You have different options - but the simplest is just to use -joiner_annotate as parameter of tokenize.lua.

let us know if this work for you!

liluhao1982 · February 20, 2017, 7:03am

Thanks for your answer.

I know the corpus should be tokenized by running tokenize.lua before running preprocess.lua / train.lua.

But when call deployed NMT engine by translation_server.lua / translate.lua, the input source strings must be tokenized first, as the NMT server will be accessed remotely by our codes on Windows platform, this means we have to find solution to tokenize the source string before sending tokenized source string to NMT server and detokenize the returned translation.

I mean if translation_server.lua / translate.lua can buit-in tokenization/detokenization and can accept plaint text then do tokenizaiton / detokenizaiton automatically when call translate command. If so, that means we needn’t to do tokenization / detokenization when call translate function.

Your language-independent tokenizer/detokenizer in NMT seems great, I didn’t find a better tokenizer/detokenizer than this.

Please correct me if wrong.

Thanks.

vince62s · February 20, 2017, 8:07am

again if you check the PR in my previous message, you will find a rest_translation_server.lua
which does the same thing as translation_server.lua BUT it will first tokenize the input and detokenize the output.
Plus, this is a rest syntax.
So your Windows app can send direclty plain text.

liluhao1982 · February 20, 2017, 9:23am

Thanks.

I’ve downloaded your REST version from path below:

https://github.com/vince62s/OpenNMT/tree/b2d99156a5bcc13aab4a4ef2274d79bac431511e

And then ran luarocks make rocks/opennmt-scm-1.rockspec

I try to start the server by rest_translation_server.lua, but errors appear, please see more detail in attachment.

I’ve installed necessary dependencies already and can start translation server by translation_server.lua without problem.

If there any special dependencies needed for launching rest_translation_server.lua？

Thanks again.

vince62s · February 20, 2017, 9:29am

sorry yes you need this:
https://github.com/hishamhm/restserver

liluhao1982 · February 20, 2017, 10:30am

Thanks very much. Now I can launch the server by running - rest_translation_server.lua

I have another problem, how to access the server above remotely?

I can access the server launched by translation_server.lua by follow codes without problem, but when execute same codes over the server launched by rest_translation_server.lua, there is no any response from the server.

http://opennmt.net/Advanced/#c-translator

import zmq, sys, json
sock = zmq.Context().socket(zmq.REQ)
sock.connect(“tcp://127.0.0.1:5556”)
sock.send(json.dumps([{“src”: " ".join(sys.argv[1:])}]))
print sock.recv()

If there any codes sample for accessing server launched by rest_translation_server.lua？

Thanks.

vince62s · February 20, 2017, 10:36am

look at the top of the .lua script

you will see an example with curl

very simple.

liluhao1982 · February 20, 2017, 1:45pm

Thanks for the hint.

It seems the server can receive the request now, but it throws errors on client’s side and no translation returns. Any suggestion here?

Thanks.

vince62s · February 20, 2017, 2:21pm

what target laguage are you using ?
I restested it, works fine with european languages.
this is an error at detokenization time …

liluhao1982 · February 20, 2017, 2:38pm

The language pair is en-US ->es-ES. I also test other engines (en-US -> zh-CN, en-US->it-IT), i got the same error.

vince62s · February 20, 2017, 2:50pm

ok I know what can happen …

check the options at the beginning of the file rest_translation_server.lua

I modified the default values:
cmd:option(’-mode’, ‘conservative’, [[Define how aggressive should the tokenization be - ‘aggressive’ only keeps sequences of letters/numbers,
‘conservative’ allows mix of alphanumeric as in: ‘2,000’, ‘E65’, ‘soft-landing’]])
cmd:option(’-joiner_annotate’, true, [[Include joiner annotation using ‘joiner’ character]])
cmd:option(’-joiner’, separators.joiner_marker, [[Character used to annotate joiners]])
cmd:option(’-joiner_new’, false, [[in joiner_annotate mode, ‘joiner’ is an independent token]])
cmd:option(’-case_feature’, true, [[Generate case feature]])

You need to make them match your model setting
sorry for that

liluhao1982 · February 20, 2017, 3:15pm

Thanks very much, it works now.

Hope this can be released soon.

Have nice day.

liluhao1982 · February 21, 2017, 1:57am

Hi again,

I met another issue when ran rest_translation_server.lua on another engine after incremental training by following the process on below article:

https://github.com/OpenNMT/OpenNMT/issues/72

Error detail:

translation_server.lua can launch the same engine without problem.

Do you have any suggestion?

Thanks for your patience.

vince62s · February 21, 2017, 6:35am

you seem to miss tds
try luarocks install tds

liluhao1982 · September 11, 2017, 10:07am

Hi,
If is it possible to add a new option for enable/disable detokenization for Translation when deploy rest server?

I found the detokenization for Translation is embeded in OpenNMT by default when return translation.

github.com

OpenNMT/OpenNMT/blob/master/tools/rest_translation_server.lua

#!/usr/bin/env lua
--[[
  This requires the restserver-xavante rock to run.
  run server (this file)
  th tools/rest_translation_server.lua -model ../Recipes/baseline-1M-enfr/exp/model-baseline-1M-enfr_epoch13_3.44.t7 -gpuid 1
  query the server:
  curl -v -H "Content-Type: application/json" -X POST -d '[{ "src" : "international migration" }]' http://127.0.0.1:7784/translator/translate
]]

require('onmt.init')
local tokenizer = require('tools.utils.tokenizer')
local BPE = require ('tools.utils.BPE')
local restserver = require("tools.restserver.restserver")

local cmd = onmt.utils.ExtendedCmdLine.new('rest_translation_server.lua')

local options = {
  {
    '-host', '127.0.0.1',
    [[Host to run the server on.]]

This file has been truncated. show original

I asked this as we usually need the tokenized instead of detokenized translations in our application.

If your tokenizer/detokenizer has been published as dll/library which can be called by other application?
Thanks for your support.