Preprocessing tagged data dictionary problem

dbl · March 5, 2017, 9:30pm

When I used ONMT’s tokenizer to create sample tagged data (tagging for case) and preprocessing them, the resulting dictionaries contained tokens only (no tags).

I then ran preprocess on some data that I had recently used in a moses factored model. Source (English) had token|stem|POS; target (Traditional Chinese) had token|POS. The dictioraries created here included the tags, so presumably they are considered part of the tokens rather than features. Here’s the command I used:

th preprocess.lua -train_src train_clean_factored.en -train_tgt train_clean_factored.zh-tw -valid_src tune_clean_factored.en -valid_tgt tune_clean_factored.zh-tw -src_words_min_frequency 2 -tgt_words_min_frequency 2 -src_seq_length 60 -tgt_seq_length 90 -save_data factored-min_freq2-srclen60-tgtlen90

Head of the .dict files below

<blank> 1
<unk> 2
<s> 3
</s> 4
the|the|DT 5
.|.|. 6
,|,|, 7
to|to|TO 8
a|a|DT 9
{1}|{1}|CD 10

<blank> 1
<unk> 2
<s> 3
</s> 4
。|PU 5
，|PU 6
的|DEC 7
「|PU 8
」|PU 9
的|DEG 10

Sample sentences from the training corpus:

app|app|NN startup|startup|NN and|and|CC transition|transition|NN times|time|NNS improved|improve|VBN in|in|IN general|general|JJ

應|AD 用|P 程|NN 式|NN 的|DEG 啟|NN 動|NN 與|NN 轉|NN 換|NN 時|NN 間|NN 總|NN 體|NN 上|LC 有所|VV 改|VV 進|NN

How can I get the files to parse correctly?

srush · March 6, 2017, 12:50am

Just to confirm, are you following this guide

“To use additional features, directly modify your data by appending labels to each word with the special character ￨ (unicode character FFE8).”

Features marker looks like pipe, but it is a different unicode character.

dbl · March 6, 2017, 1:38am

D’oh! Thanks, @srush… User error. Easily remedied by sed.

dbl · March 6, 2017, 2:39am

OK, I ran sed -i 's/|/￨/g' *.* on those files, then re-ran preprocess as in the original post.

I’m now getting an error during preprocessing with this traceback:

[03/06/17 02:34:14 INFO] Building train vocabularies...	
[03/06/17 02:35:28 INFO] Created word dictionary of size 113548 (pruned from 166192)	
/home/dblandan/gitstuff/torch/install/bin/luajit: ./onmt/data/Vocabulary.lua:142: attempt to index a nil value
stack traceback:
	./onmt/data/Vocabulary.lua:142: in function 'init'
	preprocess.lua:48: in function 'main'
	preprocess.lua:115: in main chunk
	[C]: in function 'dofile'
	...tuff/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x00405d50

Have I missed something obvious again?

guillaumekln · March 6, 2017, 8:39am

There was a small error. Could you retry with the newest version:

dbl · March 6, 2017, 1:11pm

Perfect, thanks!