As I know the input source string should be tokenized first and translation should be detokenized,
My question is: If the tokenization for Source and detokenization for translation can be executed automatically by the Translation Server? Or we need extra codes to tokenize the source before sending to Translation server and detokize the returned translation?
It seems not refer to my testing:
I tokenize the source string manually - “This ’ is a testing only .”, the translation can be returned without unk, but the translation seems NOT detokenized.
I did NOT tokenize the source string - “This’s a testing only.”, the translation returned with several unk, and translation seems NOT detokenized also.
This is available as lua library and we also ported it to C++. The only constraint is that you train this engine with this tokenization. You have different options - but the simplest is just to use -joiner_annotate as parameter of tokenize.lua.
I know the corpus should be tokenized by running tokenize.lua before running preprocess.lua / train.lua.
But when call deployed NMT engine by translation_server.lua / translate.lua, the input source strings must be tokenized first, as the NMT server will be accessed remotely by our codes on Windows platform, this means we have to find solution to tokenize the source string before sending tokenized source string to NMT server and detokenize the returned translation.
I mean if translation_server.lua / translate.lua can buit-in tokenization/detokenization and can accept plaint text then do tokenizaiton / detokenizaiton automatically when call translate command. If so, that means we needn’t to do tokenization / detokenization when call translate function.
Your language-independent tokenizer/detokenizer in NMT seems great, I didn’t find a better tokenizer/detokenizer than this.
again if you check the PR in my previous message, you will find a rest_translation_server.lua
which does the same thing as translation_server.lua BUT it will first tokenize the input and detokenize the output.
Plus, this is a rest syntax.
So your Windows app can send direclty plain text.
I have another problem, how to access the server above remotely?
I can access the server launched by translation_server.lua by follow codes without problem, but when execute same codes over the server launched by rest_translation_server.lua, there is no any response from the server.
check the options at the beginning of the file rest_translation_server.lua
I modified the default values:
cmd:option(’-mode’, ‘conservative’, [[Define how aggressive should the tokenization be - ‘aggressive’ only keeps sequences of letters/numbers,
‘conservative’ allows mix of alphanumeric as in: ‘2,000’, ‘E65’, ‘soft-landing’]]) cmd:option(’-joiner_annotate’, true, [[Include joiner annotation using ‘joiner’ character]])
cmd:option(’-joiner’, separators.joiner_marker, [[Character used to annotate joiners]])
cmd:option(’-joiner_new’, false, [[in joiner_annotate mode, ‘joiner’ is an independent token]]) cmd:option(’-case_feature’, true, [[Generate case feature]])
You need to make them match your model setting
sorry for that