Hi , I have 130k train src tgt pairs, but the log info of training process shows that there are only 22k train pairs after we preprocess the data

Loading train dataset from data/***.train.1.pt, number of examples: 22307

the train src and tgt sentences are written to src-train.txt and tgt-train.txt respectively.

Hello @zhaoguangxiang, what is your process/command for “preprocessing the data”?

1 Like

python3 preprocess.py -train_src data/convai2_new/src-train.txt -train_tgt data/convai2_new/tgt-train.txt -valid_src data/convai2_new/src-val.txt -valid_tgt data/convai2_new/tgt-val.txt -save_data data/convai2_new

did you play with src_seq_length and tgt_seq_length options? These options filter out too long sentences. Just try with values larger than then default value (50).


Thank you,you are right