Nothing in vocab.pt after preprocessing

WenTsai · April 22, 2018, 2:26pm

Hi,

I remember that I use the same data to preprocess without problems, but these times something strange happened.

I have both *.train.pt and *.valid.pt, however, the *vocab.pt is empty. The size of train.pt and valid.pt look normally, but I cannot open them to make sure. I opened the vocab.pt and found there’re only 4 default tokens inside.

This didn’t happen to me before, and I cannot find out the reason. I’m processing Chinese data, and it work normally when the data have been segmented. It didn’t work while I split the data to single chars.

Is there any idea could help me fix it or debug it?

WenTsai · April 25, 2018, 5:15am

I solved it by setting src_seq_length and tgt_seq_length.

More information: https://github.com/OpenNMT/OpenNMT/blob/master/docs/data/preparation.md#sentence-length