When I used ONMT’s tokenizer to create sample tagged data (tagging for case) and preprocessing them, the resulting dictionaries contained tokens only (no tags).
I then ran preprocess on some data that I had recently used in a moses factored model. Source (English) had token|stem|POS; target (Traditional Chinese) had token|POS. The dictioraries created here included the tags, so presumably they are considered part of the tokens rather than features. Here’s the command I used:
OK, I ran sed -i 's/|/│/g' *.* on those files, then re-ran preprocess as in the original post.
I’m now getting an error during preprocessing with this traceback:
[03/06/17 02:34:14 INFO] Building train vocabularies...
[03/06/17 02:35:28 INFO] Created word dictionary of size 113548 (pruned from 166192)
/home/dblandan/gitstuff/torch/install/bin/luajit: ./onmt/data/Vocabulary.lua:142: attempt to index a nil value
stack traceback:
./onmt/data/Vocabulary.lua:142: in function 'init'
preprocess.lua:48: in function 'main'
preprocess.lua:115: in main chunk
[C]: in function 'dofile'
...tuff/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50