OpenNMT-py Training takes 4 hours - regardless of corpus size

It doesn’t seem to matter if my corpora has 10K, 100K, 1 million, 20 million segments, it takes 4 hours to run on my GPU. I see that there are 100K steps, again, regardless of corpus size. This does not seem right. Before I dive into the scripts and hyper parameters, I wanted to check in to see if this is the default expected behavior. Here are the commands I use to preprocess, train, translate, and evaluate the model:

python3 -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demopython3 -data data/demo -save_model demo-model -world_size 1 -gpu_ranks 0
python3 -model enes_1mil/ -src data/src-test.txt -output pred.txt -replace_unk -verbose
~/workspace/OpenNMT-py/tools/multi-bleu.perl data/tgt-test.txt < data/pred.txt

I get a 35.68 BLEU (detok) on my EN>ES model using 20 million segments from the paracrawl corpus.

Any feedback is appreciated.



Yes, this is expected. You should configure the training option -train_steps.

Hi Guillaume,
Thanks for the feedback. Excuse my ignorance here, is there a correlation between corpora size and training steps? What you try training with 20 million segments? 1 million? 100k?

Or is there a correlation?

Edit: Better yet, if you can point me to some documentation somewhere on RNNs/LSTMs where I can better understand this parameter, that would be great!