New `hook` mechanism


(jean.senellart) #1

Hello community!

To enable further customization of OpenNMT training, I am introducing a new feature called ‘hooks’ available here to easily modify the default behaviour of some modules, to add some options or even to disable some others. The hook will apply seamlessly in tokenization, training, preprocess and translation.

Here is a preview for expected constructive feedback :slight_smile:! Let me know if you can see any use, or make any dream comes true with the following :wink:

Hooks are defined by a lua file (let us call it myhook.lua) that is dynamically loaded by passing the option -hook_file myhook in the different tools.

These hook files should return a table defining some functions corresponding to hook entry points in the code.

For instance, let us consider the following hook file:

local unicode = require('tools.utils.unicode')

local function mytokenization(_, line)
  -- fancy tokenization, it has to return a table of tokens (possibly with features)
  local tokens = {}
  for v, c, _ in unicode.utf8_iter(line) do
    if unicode.isSeparator(v) then
      table.insert(tokens, '_')
    else
      table.insert(tokens, c)
    end
  end
  return tokens
end

return {
  tokenize = mytokenization
}

save it in myhook.lua and let us try a standard tokenization:

$ echo "c'est épatant" | th tools/tokenize.lua -hook_file myhook
c ' e s t _ é p a t a n t
Tokenization completed in 0.001 seconds - 1 sentences

the tokenization function has taken over the normal tokenization.

Let us do that more cleanly, first - document this new wonderful option, we just need to declare the new option in the hook file:

local myopt =
{
  {
    '-mode', 'conservative',
    [[Define how aggressive should the tokenization be. `aggressive` only keeps sequences
      of letters/numbers, `conservative` allows a mix of alphanumeric as in: "2,000", "E65",
      "soft-landing", etc. `space` is doing space tokenization. `char` is doing character tokenization]],
    {
      enum = {'space', 'conservative', 'aggressive', 'char'}
    }
  }
}

and define a hook to the declareOpts function:

local function declareOptsFn(cmd)
  cmd:setCmdLineOptions(myopt, 'Tokenizer')
end

return {
  tokenize = mytokenization,
  declareOpts = declareOptsFn
}

checking the help now gives the right option - th tools/tokenize.lua -hook_file myhook -h:

[...]
  -mode <string> (accepted: space, conservative, aggressive, char; default: conservative)
      Define how aggressive should the tokenization be. `aggressive` 
      only keeps sequences of letters/numbers, `conservative` allows 
      a mix of alphanumeric as in: "2,000", "E65", "soft-landing", 
      etc. `space` is doing space tokenization. `char` is doing character 
      tokenization
[...]

we just need to redefine only this new mode in the tokenization function - we now check the opt

local function mytokenization(opt, line)
  -- fancy tokenization, it has to return a table of tokens (possibly with features)
  if opt.mode == "char" then
    local tokens = {}
    for v, c, _ in unicode.utf8_iter(line) do
      if unicode.isSeparator(v) then
        table.insert(tokens, '_')
      else
        table.insert(tokens, c)
      end
    end
    return tokens
  end
end

et voilà:

$ echo "c'est épatant" | th tools/tokenize.lua -hook_file myhook

gives the expected c ' est épatant

while

$ echo "c'est épatant" | th tools/tokenize.lua -hook_file myhook -mode char

gives the new: c ' e s t _ é p a t a n t


Improvement of performance by data normalization
SentencePiece vs. BPE
(Etienne Monneret) #2

It’s a kind of plugin ? Is there a list of methods that can be overloaded ?
I’m not sure to understand the real interest comparing to the direct modification of the main code… except perhaps the durability across ONMT versions…


(jean.senellart) #3

the interest is to add something not mainstream, or connected to 3rdParty tool - for instance, if you want to connect some Chinese word segmentation module inside tokenization - but it could also extend to add a personal variant to model type, a dedicated filter in the beam search, a specific dump for visualizing internal states, etc…

Also, it simplifies the entry points in the code (look for instance at beam search full implementation). I will make a list of the different entry points it currently covers, but also let me know if you have some other ideas for it…


(Panos Kanavos) #4

Hello @jean.senellart,

So, will that make it easier to connect to a tagger and add source features before sending the text to the model for translation? If so, what would be some rough guidelines for this?

Thanks!


(jean.senellart) #5

Hi @panosk - exactly. if you have a tagger in mind, let me know, and I will write an example wrapper to show how to do that.


(Panos Kanavos) #6

That would be really awesome!

I usually use TreeTagger because it has parameter files for many languages. The problem is that the default output of treetagger is one word per line, so I always use it with the Moses wrapper scripts that outputs the tags as sequence of words per line and then combine words and tags. Please see here for most of the essential info and links: Preprocessing corpus for case_feature and POS tags
I have also used Stanford NLP tagger a few times and it’s nice as it doesn’t need a wrapper to format the output.

If I can help with any way with the taggers, please let me know.
Thanks!


(jean.senellart) #7

@panosk, in the branch, I just commit a prototype hook for tree-tagger.

it contains a wrapper for tree-tagger in python that gives tree-tagger a simple http server API (because of the weird cmdline interface of tree-tagger)

to test that it can reach tree-tagger

python hooks/tree-tagger-server.py -model ~/french.par -path ~/bin/ -sent "un test"

to launch the server:

python -u hooks/tree-tagger-server.py -model ~/french.par -path ~/bin/

=> launches on localhost port 3000

to test the server:

`curl "http://localhost:3000/pos" -d "un test"

to test the hook:

th tools/tokenize.lua -hook_file hooks.tree-tagger -pos_feature < file

see the option for the python server and the hook with:

python hooks/tree-tagger-server.py -h
th tools/tokenize.lua -hook_file hooks.tree-tagger -h

(Panos Kanavos) #8

Hi @jean.senellart,

Thanks so much, that should be very useful!
I’ll try it soon, but in the meantime, can we use the hook of the tokenization in the command line of another module, the rest_translation_server.lua for instance which accepts all the options of the tokenize.lua?


(jean.senellart) #9

of course, it is the main goal of this development: complete consistency between the different flows - there is just a slight change to do in all of the entry points to enable these extensions. For the moment, I did it for the preprocessing/training tools, I will do the same on inference (translate&server) tools by the end of the week.


(Panos Kanavos) #10

That is great, actually I dared to modify the server a bit according to your changes and comments in preprocess.lua and it seems to be working fine with the hook :slight_smile:.
Thanks again for all your work!


(Panos Kanavos) #11

Hello @jean.senellart,

I’m working on the tagger hook to add more options. Currently, the joiner is not handled as it should so I made the following change to tree-tagger-server.py:

def tag(s):
  global extraneous
  l=s.split()
  for w in l:
    treetagger.stdin.write(w+'\n')
  treetagger.stdin.write('\n')
  ...

to this:

def tag(s):
  global extraneous
  joiner_char = '■'
  joiner_pos = []
  l=s.split()
  for i,w in enumerate(l):
    #if there's a joiner before the word, get rid of the joiner and store the word pos
    if w[0] == joiner_char:
      treetagger.stdin.write(w[1:]+'\n')
      joiner_pos.append(i)
    else:
      treetagger.stdin.write(w+'\n')
  treetagger.stdin.write('\n')
  ...

So, I get rid of the joiner so the tagger can mark correctly the puncuation mark and then I restore it in the end where needed. However, it seems tokenization is applied twice (or two copies of the tokenized sentence are used) so I should only get rid of the joiner without restoring it later, otherwise there are extra joiners.
Could you help me understand how tokenization is applied with hooks? In this hook, for example, is tokenization applied once and then there are two copies of the tokenized sentence, so one is used for tagging and the other is combined with the tags and sent to the model for translation?
Thanks!


(Panos Kanavos) #12

Please ignore the above, starting fresh this morning, I looked into tree-tagger.lua and everything is clear and obvious :slight_smile:
Anyway, once your branch gets into OpenNMT’s repo, I will send a PR with some interesting options for the tagger.


(jean.senellart) #13

great - I am working on some documentation and plan to merge this week. Looking forward your PR!


(jean.senellart) #14

the PR is merged - and I also added a hook with SentencePiece - details here:


(Anderleich) #15

I have just completed the first part of the tutorial to use hooks and it gave me an error. I’ve just created the file myhook.lua and typed the command:
$ echo “c’est épatant” | th tools/tokenize.lua -hook_file myhook
However, it raises the following error:
/opt/torch.git/install/bin/lua: ./onmt/utils/HookManager.lua:42: Cannot load hooks (/mnt/data/NMT/51/myhook): /opt/torch.git/install/share/lua/5.2/trepl/init.lua:389: module ‘/mnt/data/NMT/51/myhook’ not found:No LuaRocks module found for /mnt/data/NMT/51/myhook


(Anderleich) #16

Solved! The file should be located in the openNMT directory. I think this should be mentioned in the documentation.