Nothing in vocab.pt after preprocessing

pytorch

(Wen Tsai) #1

Hi,

I remember that I use the same data to preprocess without problems, but these times something strange happened.

I have both *.train.pt and *.valid.pt, however, the *vocab.pt is empty. The size of train.pt and valid.pt look normally, but I cannot open them to make sure. I opened the vocab.pt and found there’re only 4 default tokens inside.

This didn’t happen to me before, and I cannot find out the reason. I’m processing Chinese data, and it work normally when the data have been segmented. It didn’t work while I split the data to single chars.

Is there any idea could help me fix it or debug it?


(Wen Tsai) #2

I solved it by setting src_seq_length and tgt_seq_length.

More information: https://github.com/OpenNMT/OpenNMT/blob/master/docs/data/preparation.md#sentence-length