Preprocessing tagged data dictionary problem

When I used ONMT’s tokenizer to create sample tagged data (tagging for case) and preprocessing them, the resulting dictionaries contained tokens only (no tags).

I then ran preprocess on some data that I had recently used in a moses factored model. Source (English) had token|stem|POS; target (Traditional Chinese) had token|POS. The dictioraries created here included the tags, so presumably they are considered part of the tokens rather than features. Here’s the command I used:

th preprocess.lua -train_src train_clean_factored.en -train_tgt train_clean_factored.zh-tw -valid_src tune_clean_factored.en -valid_tgt tune_clean_factored.zh-tw -src_words_min_frequency 2 -tgt_words_min_frequency 2 -src_seq_length 60 -tgt_seq_length 90 -save_data factored-min_freq2-srclen60-tgtlen90

Head of the .dict files below

<blank> 1
<unk> 2
<s> 3
</s> 4
the|the|DT 5
.|.|. 6
,|,|, 7
to|to|TO 8
a|a|DT 9
{1}|{1}|CD 10

<blank> 1
<unk> 2
<s> 3
</s> 4
。|PU 5
,|PU 6
的|DEC 7
「|PU 8
」|PU 9
的|DEG 10

Sample sentences from the training corpus:

app|app|NN startup|startup|NN and|and|CC transition|transition|NN times|time|NNS improved|improve|VBN in|in|IN general|general|JJ

應|AD 用|P 程|NN 式|NN 的|DEG 啟|NN 動|NN 與|NN 轉|NN 換|NN 時|NN 間|NN 總|NN 體|NN 上|LC 有所|VV 改|VV 進|NN

How can I get the files to parse correctly?

Just to confirm, are you following this guide

“To use additional features, directly modify your data by appending labels to each word with the special character │ (unicode character FFE8).”

Features marker looks like pipe, but it is a different unicode character.

1 Like

D’oh! Thanks, @srush… User error. Easily remedied by sed. :flushed:

OK, I ran sed -i 's/|/│/g' *.* on those files, then re-ran preprocess as in the original post.

I’m now getting an error during preprocessing with this traceback:

[03/06/17 02:34:14 INFO] Building train vocabularies...	
[03/06/17 02:35:28 INFO] Created word dictionary of size 113548 (pruned from 166192)	
/home/dblandan/gitstuff/torch/install/bin/luajit: ./onmt/data/Vocabulary.lua:142: attempt to index a nil value
stack traceback:
	./onmt/data/Vocabulary.lua:142: in function 'init'
	preprocess.lua:48: in function 'main'
	preprocess.lua:115: in main chunk
	[C]: in function 'dofile'
	...tuff/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x00405d50

Have I missed something obvious again?

There was a small error. Could you retry with the newest version:

1 Like

Perfect, thanks! :+1: