Hello community!
To enable further customization of OpenNMT training, I am introducing a new feature called ‘hooks’ available here to easily modify the default behaviour of some modules, to add some options or even to disable some others. The hook will apply seamlessly in tokenization, training, preprocess and translation.
Here is a preview for expected constructive feedback ! Let me know if you can see any use, or make any dream comes true with the following …
Hooks are defined by a lua file (let us call it myhook.lua
) that is dynamically loaded by passing the option -hook_file myhook
in the different tools.
These hook files should return a table defining some functions corresponding to hook entry points in the code.
For instance, let us consider the following hook file:
local unicode = require('tools.utils.unicode')
local function mytokenization(_, line)
-- fancy tokenization, it has to return a table of tokens (possibly with features)
local tokens = {}
for v, c, _ in unicode.utf8_iter(line) do
if unicode.isSeparator(v) then
table.insert(tokens, '_')
else
table.insert(tokens, c)
end
end
return tokens
end
return {
tokenize = mytokenization
}
save it in myhook.lua
and let us try a standard tokenization:
$ echo "c'est épatant" | th tools/tokenize.lua -hook_file myhook
c ' e s t _ é p a t a n t
Tokenization completed in 0.001 seconds - 1 sentences
the tokenization function has taken over the normal tokenization.
Let us do that more cleanly, first - document this new wonderful option, we just need to declare the new option in the hook file:
local myopt =
{
{
'-mode', 'conservative',
[[Define how aggressive should the tokenization be. `aggressive` only keeps sequences
of letters/numbers, `conservative` allows a mix of alphanumeric as in: "2,000", "E65",
"soft-landing", etc. `space` is doing space tokenization. `char` is doing character tokenization]],
{
enum = {'space', 'conservative', 'aggressive', 'char'}
}
}
}
and define a hook to the declareOpts function:
local function declareOptsFn(cmd)
cmd:setCmdLineOptions(myopt, 'Tokenizer')
end
return {
tokenize = mytokenization,
declareOpts = declareOptsFn
}
checking the help now gives the right option - th tools/tokenize.lua -hook_file myhook -h
:
[...]
-mode <string> (accepted: space, conservative, aggressive, char; default: conservative)
Define how aggressive should the tokenization be. `aggressive`
only keeps sequences of letters/numbers, `conservative` allows
a mix of alphanumeric as in: "2,000", "E65", "soft-landing",
etc. `space` is doing space tokenization. `char` is doing character
tokenization
[...]
we just need to redefine only this new mode in the tokenization function - we now check the opt
local function mytokenization(opt, line)
-- fancy tokenization, it has to return a table of tokens (possibly with features)
if opt.mode == "char" then
local tokens = {}
for v, c, _ in unicode.utf8_iter(line) do
if unicode.isSeparator(v) then
table.insert(tokens, '_')
else
table.insert(tokens, c)
end
end
return tokens
end
end
et voilà:
$ echo "c'est épatant" | th tools/tokenize.lua -hook_file myhook
gives the expected c ' est épatant
while
$ echo "c'est épatant" | th tools/tokenize.lua -hook_file myhook -mode char
gives the new: c ' e s t _ é p a t a n t