How do I use tokenizer.perl?

Soap89 · June 8, 2019, 12:38pm

Hello guys! Thank you for all the help so far. I’ve really trying to get the tokenizer work. But to no avail. Am I typing it in wrong? If so, what is going on?
This is my command line:

(base) PS>perl .\tools\tokenizer.perl -l zh -threads 4 tools\tgt-train.txt tools\output_en.tok.txt
Tokenizer Version 1.1
Language: zh
Number of threads: 4

It really just kind of ends there and does not tokenize my file at all

park · June 8, 2019, 3:32pm

Just specify the input output.
Also I suggest you to give the option

-no-escape

Please enter the following.

perl ./tools/tokenizer.perl -l zh -threads 4 -no-escape < tgt-train.txt > output_en.tok.txt

yaren · June 8, 2019, 3:37pm

像这样用

./tools/tokenizer.perl -l en < /home/OpenNMT-py/zrinput.txt5 > /home/OpenNMT-py/zrinput.txt51

Zaitinkhuma · August 8, 2020, 11:40am

How tokenization is done for all the trainind data, test data, validation data please help, i’m using opennmt-py. i would like to do tokenization on parallel data i.e English-Mizo language. Your help will be really appreciated.