OpenNMT-py Training takes 4 hours - regardless of corpus size

stevebpdx · April 19, 2019, 8:15am

Hello,
It doesn’t seem to matter if my corpora has 10K, 100K, 1 million, 20 million segments, it takes 4 hours to run on my GPU. I see that there are 100K steps, again, regardless of corpus size. This does not seem right. Before I dive into the scripts and hyper parameters, I wanted to check in to see if this is the default expected behavior. Here are the commands I use to preprocess, train, translate, and evaluate the model:

python3 preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demopython3 train.py -data data/demo -save_model demo-model -world_size 1 -gpu_ranks 0
python3 translate.py -model enes_1mil/demo-model_step_100000.pt -src data/src-test.txt -output pred.txt -replace_unk -verbose
~/workspace/OpenNMT-py/tools/multi-bleu.perl data/tgt-test.txt < data/pred.txt

I get a 35.68 BLEU (detok) on my EN>ES model using 20 million segments from the paracrawl corpus.

Any feedback is appreciated.

Thanks,
Steve

guillaumekln · April 19, 2019, 10:08am

Hi,

Yes, this is expected. You should configure the training option -train_steps.

stevebpdx · April 19, 2019, 10:32am

Hi Guillaume,
Thanks for the feedback. Excuse my ignorance here, is there a correlation between corpora size and training steps? What you try training with 20 million segments? 1 million? 100k?

Or is there a correlation?

Edit: Better yet, if you can point me to some documentation somewhere on RNNs/LSTMs where I can better understand this parameter, that would be great!

Thanks,
Steve