The epoch can not be finished when training from checkpoint model twice

huache · September 27, 2017, 7:39am

Well, actually, this problem was found because of the brnn stucking problem.

I found that if continue a training from checkpoint model twice (different checkpoint model). The epoch would not end up even the number of iteration was exceed.

.........
[09/27/17 13:34:02 INFO] Epoch 3 ; Iteration 76400/76803 ; Optim SGD LR 1.000000 ; Source tokens/s 6994 ; Perplexity 23.82	
[09/27/17 13:34:12 INFO] Epoch 3 ; Iteration 76450/76803 ; Optim SGD LR 1.000000 ; Source tokens/s 7004 ; Perplexity 22.87	
[09/27/17 13:34:24 INFO] Epoch 3 ; Iteration 76500/76803 ; Optim SGD LR 1.000000 ; Source tokens/s 7046 ; Perplexity 26.03	
[09/27/17 13:34:35 INFO] Epoch 3 ; Iteration 76550/76803 ; Optim SGD LR 1.000000 ; Source tokens/s 7053 ; Perplexity 24.08	
[09/27/17 13:34:47 INFO] Epoch 3 ; Iteration 76600/76803 ; Optim SGD LR 1.000000 ; Source tokens/s 7044 ; Perplexity 23.67	
[09/27/17 13:34:57 INFO] Epoch 3 ; Iteration 76650/76803 ; Optim SGD LR 1.000000 ; Source tokens/s 6786 ; Perplexity 22.38	
[09/27/17 13:35:09 INFO] Epoch 3 ; Iteration 76700/76803 ; Optim SGD LR 1.000000 ; Source tokens/s 6762 ; Perplexity 24.63	
[09/27/17 13:35:21 INFO] Epoch 3 ; Iteration 76750/76803 ; Optim SGD LR 1.000000 ; Source tokens/s 7448 ; Perplexity 26.97	
[09/27/17 13:35:32 INFO] Epoch 3 ; Iteration 76800/76803 ; Optim SGD LR 1.000000 ; Source tokens/s 6960 ; Perplexity 23.75	
[09/27/17 13:35:43 INFO] Epoch 3 ; Iteration 76850/76803 ; Optim SGD LR 1.000000 ; Source tokens/s 7113 ; Perplexity 23.02	
[09/27/17 13:35:55 INFO] Epoch 3 ; Iteration 76900/76803 ; Optim SGD LR 1.000000 ; Source tokens/s 7298 ; Perplexity 27.34	
[09/27/17 13:36:07 INFO] Epoch 3 ; Iteration 76950/76803 ; Optim SGD LR 1.000000 ; Source tokens/s 7105 ; Perplexity 26.38	
[09/27/17 13:36:19 INFO] Epoch 3 ; Iteration 77000/76803 ; Optim SGD LR 1.000000 ; Source tokens/s 7458 ; Perplexity 25.72	
[09/27/17 13:36:30 INFO] Epoch 3 ; Iteration 77050/76803 ; Optim SGD LR 1.000000 ; Source tokens/s 6877 ; Perplexity 24.18	
[09/27/17 13:36:43 INFO] Epoch 3 ; Iteration 77100/76803 ; Optim SGD LR 1.000000 ; Source tokens/s 7407 ; Perplexity 27.12
.........

guillaumekln · September 27, 2017, 7:59am

Could you also share (if possible) all the command lines you ran?

If this is easily reproducible, please open an issue on GitHub.

huache · September 27, 2017, 8:25am

OK, I will try to reproduce it.

huache · September 28, 2017, 4:25am

Uh, It seems that this problem is not easy to reproduce.

I try to reproduce it with demo data of OpenNMT.

Using CPU only, it works fine. Using one GPU, I cancel and continue through all 13 epoch again and again. Nothing abnormal situation happen.

I guess it would be emerge when using two GPU, but the two GPU machine is training model. I will be try again when that machine is available.

jean.senellart · October 2, 2017, 12:40am

Thanks @huache - in the meantime, please provide at least your command-line you are using for this 2-GPU mode so that we can investigate.

huache · October 2, 2017, 3:45am

Thank you for your attention. I have reproduced the problem on a tow GPU machine, and follow the suggestion of @guillaumekln, open a issue on Github: Iteration number exceed the max iteration number when continuing a two GPU training

I show all commands and steps need to reproduce the problem in the issue. Hope it would be helpful.

If you need any more information about this problem, just @ me on this post or the issue. I’ll try my best to provide it.

huache · October 2, 2017, 10:04am

@guillaumekln @jean.senellart
The problem has already been solved. You guys are so efficient and helpful for me, I really appreciate it !