Possible to combine the tokenize sentencepiece hook with dataset sampling?

(David Landan) #1

Just using the sentencepiece hook in train.lua doesn’t work (because it’s a tokenize hook, and train.lua doesn’t recognize the -sentencepiece option that must accompany the hook).

It seems like tokenizer options in train.lua are denoted as -tok_*, so something like -tok_hook_file and -tok_hook_options (someplace to put “-sentencepiece foo.model”) would be very useful.

(Guillaume Klein) #2

The sentencepiece hook can already be used during training (and translation). See the options extended by the hook:

th train.lua -hook_file hooks/sentencepiece -h | grep -C 5 sentencepiece

(David Landan) #3

Aha… I missed that in the online documentation. Cheers!

(Guillaume Klein) #4

Yes, the online documentation is generated without hook so it does not appear.