Validation dataset

LinyuZhang · May 25, 2020, 5:25pm

Hello everyone. I met a little question when I was training a zh-en translation model.
I use a dataset which include 1 million lines corpus, 10 thousand lines of validation dataset, and 10 thousand of testing dataset. All of zh parallel corpus are tokenized by Jieba and Moses. All of English corpus are tokenized by Moses.

I preprocess these dataset with src(tgt)_seq_length 80. Traning with all default parameters.

it is normal when the model is traning. however, it always shows that “CUDA: out of memory” when validating process. Can any friend give me some advices how to address this problem? Thanks so much!

francoishernandez · May 27, 2020, 3:18pm

Hey there,
You might want to reduce the validation batch size with the -valid_batch_size flag (default is 32).

LinyuZhang · May 28, 2020, 11:28am

Thanks so much for your suggestion, I have done it suceffuly