Hello again,
I don’t have a decent GPU sadly, therefore using Google Colab, generally Tesla T4 is assigned to my account which is nice.
GPU is quite powerful and I’m following WMT14 translate script from here:
https://opennmt.net/OpenNMT-py/examples/Translation.html
Started training, here are the console outputs:
[2021-04-14 20:53:44,478 INFO] Step 100/100000; acc: 3.57; ppl: 6607.30; xent: 8.80; lr: 0.00001; 5557/5715 tok/s; 158 sec
[2021-04-14 20:56:27,445 INFO] Step 200/100000; acc: 4.72; ppl: 3428.49; xent: 8.14; lr: 0.00002; 5200/5498 tok/s; 321 sec
[2021-04-14 20:59:10,803 INFO] Step 300/100000; acc: 7.41; ppl: 1355.33; xent: 7.21; lr: 0.00004; 5297/5468 tok/s; 484 sec
[2021-04-14 21:01:54,482 INFO] Step 400/100000; acc: 9.19; ppl: 703.58; xent: 6.56; lr: 0.00005; 5304/5522 tok/s; 648 sec
[2021-04-14 21:04:40,492 INFO] Step 500/100000; acc: 10.63; ppl: 511.33; xent: 6.24; lr: 0.00006; 5392/5460 tok/s; 814 sec
[2021-04-14 21:07:23,062 INFO] Step 600/100000; acc: 11.35; ppl: 401.52; xent: 6.00; lr: 0.00007; 5215/5471 tok/s; 977 sec
[2021-04-14 21:10:07,182 INFO] Step 700/100000; acc: 12.96; ppl: 324.71; xent: 5.78; lr: 0.00009; 5315/5452 tok/s; 1141 sec
[2021-04-14 21:12:52,057 INFO] Step 800/100000; acc: 13.88; ppl: 272.98; xent: 5.61; lr: 0.00010; 5323/5495 tok/s; 1306 sec
[2021-04-14 21:15:37,431 INFO] Step 900/100000; acc: 14.66; ppl: 238.75; xent: 5.48; lr: 0.00011; 5232/5496 tok/s; 1471 sec
[2021-04-14 21:18:20,960 INFO] Step 1000/100000; acc: 15.99; ppl: 203.52; xent: 5.32; lr: 0.00012; 5246/5467 tok/s; 1634 sec
[2021-04-14 21:21:01,671 INFO] Step 1100/100000; acc: 17.06; ppl: 179.93; xent: 5.19; lr: 0.00014; 5367/5566 tok/s; 1795 sec
[2021-04-14 21:23:44,878 INFO] Step 1200/100000; acc: 17.93; ppl: 160.77; xent: 5.08; lr: 0.00015; 5234/5434 tok/s; 1958 sec
But the problem is, roughly calculating, 100.000 steps will take 2 days (~160 seconds per 100 steps 160.000 seconds ~ 44.5 hours)
Is there any option to continue from some saved points? For example I have trained 15k steps today, can i go on from the last trained point? I have seen train_from parameter, searched docs and forum but couldn’t find a recent example sadly.
Thanks in advance