Resume from training

chopinml · April 14, 2021, 9:35pm

Hello again,

I don’t have a decent GPU sadly, therefore using Google Colab, generally Tesla T4 is assigned to my account which is nice.

GPU is quite powerful and I’m following WMT14 translate script from here:
https://opennmt.net/OpenNMT-py/examples/Translation.html

Started training, here are the console outputs:

[2021-04-14 20:53:44,478 INFO] Step 100/100000; acc: 3.57; ppl: 6607.30; xent: 8.80; lr: 0.00001; 5557/5715 tok/s; 158 sec
[2021-04-14 20:56:27,445 INFO] Step 200/100000; acc: 4.72; ppl: 3428.49; xent: 8.14; lr: 0.00002; 5200/5498 tok/s; 321 sec
[2021-04-14 20:59:10,803 INFO] Step 300/100000; acc: 7.41; ppl: 1355.33; xent: 7.21; lr: 0.00004; 5297/5468 tok/s; 484 sec
[2021-04-14 21:01:54,482 INFO] Step 400/100000; acc: 9.19; ppl: 703.58; xent: 6.56; lr: 0.00005; 5304/5522 tok/s; 648 sec
[2021-04-14 21:04:40,492 INFO] Step 500/100000; acc: 10.63; ppl: 511.33; xent: 6.24; lr: 0.00006; 5392/5460 tok/s; 814 sec
[2021-04-14 21:07:23,062 INFO] Step 600/100000; acc: 11.35; ppl: 401.52; xent: 6.00; lr: 0.00007; 5215/5471 tok/s; 977 sec
[2021-04-14 21:10:07,182 INFO] Step 700/100000; acc: 12.96; ppl: 324.71; xent: 5.78; lr: 0.00009; 5315/5452 tok/s; 1141 sec
[2021-04-14 21:12:52,057 INFO] Step 800/100000; acc: 13.88; ppl: 272.98; xent: 5.61; lr: 0.00010; 5323/5495 tok/s; 1306 sec
[2021-04-14 21:15:37,431 INFO] Step 900/100000; acc: 14.66; ppl: 238.75; xent: 5.48; lr: 0.00011; 5232/5496 tok/s; 1471 sec
[2021-04-14 21:18:20,960 INFO] Step 1000/100000; acc: 15.99; ppl: 203.52; xent: 5.32; lr: 0.00012; 5246/5467 tok/s; 1634 sec
[2021-04-14 21:21:01,671 INFO] Step 1100/100000; acc: 17.06; ppl: 179.93; xent: 5.19; lr: 0.00014; 5367/5566 tok/s; 1795 sec
[2021-04-14 21:23:44,878 INFO] Step 1200/100000; acc: 17.93; ppl: 160.77; xent: 5.08; lr: 0.00015; 5234/5434 tok/s; 1958 sec

But the problem is, roughly calculating, 100.000 steps will take 2 days (~160 seconds per 100 steps 160.000 seconds ~ 44.5 hours)

Is there any option to continue from some saved points? For example I have trained 15k steps today, can i go on from the last trained point? I have seen train_from parameter, searched docs and forum but couldn’t find a recent example sadly.

Thanks in advance

francoishernandez · April 15, 2021, 7:44am

train_from allows you to give a checkpoint to start from, instead of random initialization of the parameters. It’s not ideal though since iteration on the datasets will start from scratch. But for WMT14 data it might not be that much of an issue as it’s not very big.

chopinml · April 15, 2021, 12:32pm

Thank you so much for the information, that’s not a full pause / run option then.

If this functionality can be added it will be very great. May be with .pt file, state can also be serialized into a txt file etc.

Every day we can download from colab to our machine, upload the .pt file to dropbox, wget with google colab next day, and run 3-5 hours, save and continue each day etc. this can be very cost effective usage of the library.

Please consider this

francoishernandez · April 15, 2021, 2:49pm

It’s not that easy, see How to reproduce the same training process when using "train_from" · Issue #2006 · OpenNMT/OpenNMT-py · GitHub

chopinml · April 15, 2021, 3:09pm

Oh I see, you have tried but did not get the same results in the past then. That would be very good if possible but anyway, thanks for the great tool again.