OpenNMT Forum

Incomplete loading data-sets during training

I tried to train a 40 million data-set German -> English with onmt_train on the command line. The system created 40 files, but during the training steps only 10 .pt files were loaded. Why were not all 40 training files loaded during the training steps? Is there a maximum? I am new to open nmt. Please can anybody help me.

You probably didn’t give it enough training steps so that it would loop over all the data.

I used the default of 100.000 training steps, but during this process only 10 of 40 .pt files were loaded.

Did you keep the training logs? If so, could you share these?
Also, can you share the command line you executed?

Thank you Francois for your reply. Unfortunately I closed the computer and lost the training logs (stupid of me), but I can share the command line: onmt_train -data data/demo40m -save_model demo40m-model.
Strangely enough during the training first data-set was loaded, next was loaded and after that, next up to After the 100.000 steps the training was completed, and the mini-anaconda prompt appeared. Hopefully this info will help you.

Default batch size is 64 examples. With 100k steps, it means it’ll have seen 6.4 million examples, so only 16% of your dataset.
Nothing anormal here.
The order in which the shards are read is because it relies on glob and not on a sort on the int id themselves.

Thanks again for your reply. What should I do to get the whole dataset used?

As mentioned above, with 100k steps, you are only training on 16% of your dataset. To train on 100%, increase the number of steps to 625K.

Thanks Arbin. Could I also increase the batch size to 400?

I don’t think increasing batch size to 400 is an alternative- as such a huge batch might not fit in your system’s memory.

Thanks again Arbin. I have PC with 32 Gigabyte RAM en 1 Terabyte ssd. Would this fit in your opinion?

You can experiment with batch size- keep increasing until you get an error and settle for something less/appropriate.

Thank you very much. I will experiment with both increasing batch_size as well as increasing train_steps.
Your answer has helped me a lot.