Hi , I have 130k train src tgt pairs, but the log info of training process shows that there are only 22k train pairs after we preprocess the data

pytorch

(Guangxiang Zhao) #1
Loading train dataset from data/***.train.1.pt, number of examples: 22307

the train src and tgt sentences are written to src-train.txt and tgt-train.txt respectively.


(jean.senellart) #2

Hello @zhaoguangxiang, what is your process/command for “preprocessing the data”?


(Guangxiang Zhao) #3

python3 preprocess.py -train_src data/convai2_new/src-train.txt -train_tgt data/convai2_new/tgt-train.txt -valid_src data/convai2_new/src-val.txt -valid_tgt data/convai2_new/tgt-val.txt -save_data data/convai2_new


(jean.senellart) #4

did you play with src_seq_length and tgt_seq_length options? These options filter out too long sentences. Just try with values larger than then default value (50).