How to resume training from last interupted state

Ravneet · July 18, 2019, 11:24am

I was training a translation model, but due to a power failure system eventually restarted. The last checkpoint I have saved is on 90K training step. Is there any way to resume training from that same state??

park · July 18, 2019, 11:50am

If you train by OpenNMT-py

–train_from, -train_from

If training from a checkpoint then this is the path to the pretrained model’s state_dict.

Default: “”

Ravneet · July 18, 2019, 12:16pm

Dear @park, Thanks for your reply. But I am getting an error “[Errno 2] No such file or directory: ‘sum_eng-model_90000.pt’”
My saved model is in the main directory, with other files like preprocessing.py, train.py are.(Default location where OpenNMT saves them).
I just added -train_from sum_eng-model_90000.pt to my command.
What correction should I made to make this work??
Many thanks in advance.

park · July 18, 2019, 12:33pm

You need to specify your model PATH

-train_from YOUR/MODEL/PATH/ sum_eng-model_90000.pt

Please add the PATH

Ravneet · July 19, 2019, 8:00pm

@park Sorry, but I am still not able to resolve it. I shifted sum_eng-model_90000.pt file in the data folder, with other trains, test, valid data files. Now I added -train_from data/sum_eng-model_90000.pt to my command, But I am still getting error file not found. Can you please help a bit.

sum_eng-model_90000.pt this is the name of file and available in the data folder.
What should I add to my command??

park · July 20, 2019, 1:08am

Please give me the full command

Ravneet · July 20, 2019, 4:22am

@park Here is the full command I am using

CUDA_VISIBLE_DEVICES=0 python train.py -src_word_vec_size 200 -tgt_word_vec_size 200 -data data/model -save_model sum_eng-model -save_checkpoint_steps 100 -world_size 1 -gpu_ranks 0 -batch_size 64 -valid_steps 10000 -train_steps 100000 -report_every 50 -train_from data/sum_eng-model_90000.pt

park · July 20, 2019, 9:03am

Try

CUDA_VISIBLE_DEVICES=0 python3 train.py -src_word_vec_size 200 -tgt_word_vec_size 200 -data data/model -save_model sum_eng-model -save_checkpoint_steps 100 -world_size 1 -gpu_ranks 0 -batch_size 64 -valid_steps 10000 -train_steps 100000 -report_every 50 --train_from data/sum_eng-model_90000.pt

Ravneet · July 20, 2019, 9:09am

@park still same error [Errno 2] No such file or directory: ‘data/sum_eng-model_90000.pt’.
File is there and file name is correct too.

park · July 20, 2019, 10:02am

try

python3 setup.py build
python3 setup.py install

Ravneet · July 20, 2019, 8:01pm

@perk Thanks for your effort. This was a dependencies Issue.

park · July 21, 2019, 3:30am

Okay, Nice work