I have 29M sentences parallel corpus which was naturally divided into 29 chunks: preprocessed.train.0.pt, preprocessed.train.1.pt etc.
I train my model using colab, which crashes every 6-12 hours, and when I warm start with the command:
!python OpenNMT-py/train.py -data “/content/gdrive/My Drive/onmt/data/ultimate_en_corpus/preprocessed” -save_model “/content/gdrive/My Drive/onmt/data/model/model_u_2” -train_from “/content/gdrive/My Drive/onmt/data/model/model_u_step_13000.pt” -gpu_rank 0
It starts training from chunk 0 all over. This way my model sees only 5-8 chunks everytime.
Can i somehow choose which chunks to start training from, or do i have to create new sub-corpus?
This is an interesting idea and we’ve been thinking of adding it, but we didn’t come to implement it yet.
As a workaround, you can rename or make some links to change the numbering of your shards.
Or, if you’d be willing to make this possible via a new flag we’d be happy to accept a PR.
Yes, this workaround came too mind too, I just havent looked into preprocessed files and doubted in case they have some metadata, but now i’ll just go and do that!