Possible to combine the tokenize sentencepiece hook with dataset sampling?

Just using the sentencepiece hook in train.lua doesn’t work (because it’s a tokenize hook, and train.lua doesn’t recognize the -sentencepiece option that must accompany the hook).

It seems like tokenizer options in train.lua are denoted as -tok_*, so something like -tok_hook_file and -tok_hook_options (someplace to put “-sentencepiece foo.model”) would be very useful.

The sentencepiece hook can already be used during training (and translation). See the options extended by the hook:

th train.lua -hook_file hooks/sentencepiece -h | grep -C 5 sentencepiece
1 Like

Aha… I missed that in the online documentation. Cheers!

Yes, the online documentation is generated without hook so it does not appear.

I’m just now getting back to trying this. In my config file, I have:

hook_file = hooks/sentencepiece
tok_src_mode = none
tok_tgt_mode = none
tok_src_joiner_annotate = true
tok_tgt_joiner_annotate = true
tok_src_sentencepiece = /data/exp03/en-ja_spm_50k.model
tok_tgt_sentencepiece = /data/exp03/en-ja_spm_50k.model

but when I run th train.lua -config /data/exp03/train.opts -preprocess_pthreads 1, I get:

train.lua: unknown option -tok_src_sentencepiece
Try 'train.lua -h' for more information, or browse the documentation in the docs/ directory.

Just to make sure, I checked the help for the hook as you suggested, and the option looks like it should work:

  -tok_{src,tgt}_sentencepiece <string> (default: '')
      Path to the model to use with sentencepiece - can be combined
      with regular tokenization mode.

I’m sure I’m missing something simple again, but I can’t figure out what it is… suggestions?

Sorry for the undocumented behavior: it is required that hook_file appears on the command line. So you just need to move hook_file out of the configuration file and back to the command line.

The reason is that we peek at the command line options before parsing them to include the options declared by the hook. Reading the configuration file currently happens after the parsing.

1 Like