"Train from" - choosing preprocess chunk

icanfast · April 24, 2020, 11:01am

Hello everyone!
I have 29M sentences parallel corpus which was naturally divided into 29 chunks: preprocessed.train.0.pt, preprocessed.train.1.pt etc.
I train my model using colab, which crashes every 6-12 hours, and when I warm start with the command:
!python OpenNMT-py/train.py -data “/content/gdrive/My Drive/onmt/data/ultimate_en_corpus/preprocessed” -save_model “/content/gdrive/My Drive/onmt/data/model/model_u_2” -train_from “/content/gdrive/My Drive/onmt/data/model/model_u_step_13000.pt” -gpu_rank 0
It starts training from chunk 0 all over. This way my model sees only 5-8 chunks everytime.
Can i somehow choose which chunks to start training from, or do i have to create new sub-corpus?
Thank you!

francoishernandez · April 24, 2020, 11:28am

Hey @icanfast
This is an interesting idea and we’ve been thinking of adding it, but we didn’t come to implement it yet.
As a workaround, you can rename or make some links to change the numbering of your shards.
Or, if you’d be willing to make this possible via a new flag we’d be happy to accept a PR.

icanfast · April 24, 2020, 11:45am

Thank you!
Yes, this workaround came too mind too, I just havent looked into preprocessed files and doubted in case they have some metadata, but now i’ll just go and do that!

francoishernandez · July 20, 2020, 5:12pm

Hey @icanfast
I opened a PR which aims at tackling this topic: https://github.com/OpenNMT/OpenNMT-py/pull/1826
Would be great if you could checkout this branch and test it in your setup.