Best model overwrtiten - Shall I use the latest model?

I have trained a few multilingual models and now, would like to run inference on my models. In the configuration file while training, I have set early_stopping: 4. As I have different configurations, some of the training setups finish earlier than the max training steps (120000).

Now, I am not sure whether I should use the latest checkpoint or the one where the log indicates as the best model. For instance, one of the configurations is trained until step 100000 and the best model is reported to be of step 30000.

Another problem is that the checkpoints are overwritten and only the last 10 checkpoints are saved. So, I don’t have the best model saved anymore! I also notice that the accuracy of the training set after the step of the best model goes to zero.

Here is how my log file (training) looks like:


[2024-06-02 23:09:46,304 INFO] Step 30000/120000; acc: 45.4; ppl:  53.4; xent: 4.0; lr: 0.00051; sents:   67752; bsz: 3720/1414/85; 58818/22354 tok/s;  17273 sec;
[2024-06-02 23:10:00,329 INFO] valid stats calculation
                           took: 14.022910594940186 s.
[2024-06-02 23:10:00,330 INFO] Train perplexity: 78.3356
[2024-06-02 23:10:00,330 INFO] Train accuracy: 38.6684
[2024-06-02 23:10:00,330 INFO] Sentences processed: 3.18657e+07
[2024-06-02 23:10:00,330 INFO] Average bsz: 3454/2389/133
[2024-06-02 23:10:00,330 INFO] Validation perplexity: 95.0617
[2024-06-02 23:10:00,330 INFO] Validation accuracy: 37.4023
[2024-06-02 23:10:00,330 INFO] Model is improving ppl: 101.871 --> 95.0617.
[2024-06-02 23:10:00,330 INFO] Model is improving acc: 36.3971 --> 37.4023.
[2024-06-02 23:10:00,357 INFO] Saving checkpoint /home/user/ahmadi/NMT/NMT/models/config_2_uniform/model_step_30000.pt
[2024-06-02 23:10:56,304 INFO] Step 30100/120000; acc: 45.6; ppl:  53.0; xent: 4.0; lr: 0.00051; sents:   66798; bsz: 3751/1392/83; 42872/15912 tok/s;  17343 sec;
[2024-06-02 23:11:49,239 INFO] Step 30200/120000; acc: 43.7; ppl:  57.9; xent: 4.1; lr: 0.00051; sents:   94506; bsz: 3506/1756/118; 52988/26538 tok/s;  17396 sec;
[2024-06-02 23:12:44,659 INFO] Step 30300/120000; acc: 43.3; ppl:  59.2; xent: 4.1; lr: 0.00051; sents:  109940; bsz: 3302/2057/137; 47665/29691 tok/s;  17451 sec;
[2024-06-02 23:13:41,752 INFO] Step 30400/120000; acc: 43.9; ppl:  57.5; xent: 4.1; lr: 0.00051; sents:  111925; bsz: 3288/2077/140; 46070/29097 tok/s;  17508 sec;
[2024-06-02 23:14:42,216 INFO] Step 30500/120000; acc: 41.4; ppl:  64.5; xent: 4.2; lr: 0.00051; sents:  110169; bsz: 3461/2269/138; 45799/30019 tok/s;  17569 sec;
[2024-06-02 23:15:44,086 INFO] Step 30600/120000; acc: 41.5; ppl:  64.1; xent: 4.2; lr: 0.00051; sents:  102924; bsz: 3502/2168/129; 45289/28032 tok/s;  17631 sec;
[2024-06-02 23:16:44,481 INFO] Step 30700/120000; acc: 41.6; ppl:  63.8; xent: 4.2; lr: 0.00050; sents:  109754; bsz: 3429/2241/137; 45420/29685 tok/s;  17691 sec;
[2024-06-02 23:17:43,541 INFO] Step 30800/120000; acc: 41.6; ppl:  64.0; xent: 4.2; lr: 0.00050; sents:  115014; bsz: 3467/2311/144; 46958/31299 tok/s;  17750 sec;
[2024-06-02 23:18:43,727 INFO] Step 30900/120000; acc: 41.7; ppl:  62.9; xent: 4.1; lr: 0.00050; sents:  106558; bsz: 3536/2190/133; 47004/29109 tok/s;  17810 sec;
[2024-06-02 23:19:42,977 INFO] Step 31000/120000; acc: 42.7; ppl:  54.6; xent: 4.0; lr: 0.00050; sents:  129455; bsz: 3034/2690/162; 40973/36319 tok/s;  17870 sec;
[2024-06-02 23:19:42,996 INFO] Saving checkpoint /home/user/ahmadi/NMT/NMT/models/config_2_uniform/model_step_31000.pt
[2024-06-02 23:20:56,164 INFO] Step 31100/120000; acc: 43.1; ppl:  53.4; xent: 4.0; lr: 0.00050; sents:  134543; bsz: 3162/2686/168; 34559/29359 tok/s;  17943 sec;
[2024-06-02 23:21:53,057 INFO] Step 31200/120000; acc: 43.5; ppl:  52.1; xent: 4.0; lr: 0.00050; sents:  132989; bsz: 3135/2652/166; 44082/37295 tok/s;  18000 sec;
[2024-06-02 23:22:50,418 INFO] Step 31300/120000; acc: 43.3; ppl:  53.2; xent: 4.0; lr: 0.00050; sents:  135378; bsz: 3186/2660/169; 44437/37102 tok/s;  18057 sec;
[2024-06-02 23:23:49,673 INFO] Step 31400/120000; acc: 43.5; ppl:  53.1; xent: 4.0; lr: 0.00050; sents:  140976; bsz: 3218/2753/176; 43447/37165 tok/s;  18116 sec;
[2024-06-02 23:24:49,875 INFO] Step 31500/120000; acc: 44.0; ppl:  51.7; xent: 3.9; lr: 0.00050; sents:  137362; bsz: 3304/2611/172; 43905/34703 tok/s;  18177 sec;
[2024-06-02 23:25:41,409 INFO] Step 31600/120000; acc: 44.2; ppl:  51.3; xent: 3.9; lr: 0.00050; sents:  138007; bsz: 3327/2603/173; 51652/40402 tok/s;  18228 sec;
[2024-06-02 23:26:38,363 INFO] Step 31700/120000; acc: 42.5; ppl:  57.8; xent: 4.1; lr: 0.00050; sents:  112080; bsz: 3528/2511/140; 49559/35275 tok/s;  18285 sec;
[2024-06-02 23:27:36,681 INFO] Step 31800/120000; acc: 41.0; ppl:  64.1; xent: 4.2; lr: 0.00050; sents:   88098; bsz: 3774/2538/110; 51778/34817 tok/s;  18343 sec;
[2024-06-02 23:28:29,901 INFO] Step 31900/120000; acc: 41.3; ppl:  62.9; xent: 4.1; lr: 0.00049; sents:   86648; bsz: 3743/2465/108; 56271/37049 tok/s;  18397 sec;
[2024-06-02 23:29:21,121 INFO] Step 32000/120000; acc: 41.5; ppl:  62.1; xent: 4.1; lr: 0.00049; sents:   85069; bsz: 3736/2508/106; 58355/39168 tok/s;  18448 sec;
[2024-06-02 23:29:21,135 INFO] Saving checkpoint /home/user/ahmadi/NMT/NMT/models/config_2_uniform/model_step_32000.pt
[2024-06-02 23:30:17,114 INFO] Step 32100/120000; acc: 41.5; ppl:  62.2; xent: 4.1; lr: 0.00049; sents:   85873; bsz: 3748/2509/107; 53544/35843 tok/s;  18504 sec;
[2024-06-02 23:31:11,881 INFO] Step 32200/120000; acc: 41.6; ppl:  61.7; xent: 4.1; lr: 0.00049; sents:   87269; bsz: 3828/2466/109; 55914/36029 tok/s;  18559 sec;
[2024-06-02 23:32:10,114 INFO] Step 32300/120000; acc: 40.3; ppl:  66.4; xent: 4.2; lr: 0.00049; sents:   96468; bsz: 3542/2807/121; 48657/38564 tok/s;  18617 sec;
[2024-06-02 23:33:07,227 INFO] Step 32400/120000; acc: 39.1; ppl:  71.1; xent: 4.3; lr: 0.00049; sents:  116882; bsz: 3313/3207/146; 46400/44922 tok/s;  18674 sec;
[2024-06-02 23:34:00,624 INFO] Step 32500/120000; acc: 39.2; ppl:  70.7; xent: 4.3; lr: 0.00049; sents:  118302; bsz: 3222/3190/148; 48280/47796 tok/s;  18727 sec;
[2024-06-02 23:34:54,105 INFO] Step 32600/120000; acc: 39.3; ppl:  69.8; xent: 4.2; lr: 0.00049; sents:  114208; bsz: 3196/3173/143; 47808/47468 tok/s;  18781 sec;
[2024-06-02 23:35:50,364 INFO] Step 32700/120000; acc: 39.6; ppl:  68.8; xent: 4.2; lr: 0.00049; sents:  110583; bsz: 3216/3030/138; 45732/43091 tok/s;  18837 sec;
[2024-06-02 23:36:57,808 INFO] Step 32800/120000; acc: 44.5; ppl:  55.3; xent: 4.0; lr: 0.00049; sents:   72207; bsz: 3706/1491/90; 43955/17682 tok/s;  18904 sec;
[2024-06-02 23:37:51,862 INFO] Step 32900/120000; acc: 46.0; ppl:  51.6; xent: 3.9; lr: 0.00049; sents:   66673; bsz: 3762/1290/83; 55704/19105 tok/s;  18958 sec;
[2024-06-02 23:38:51,286 INFO] Step 33000/120000; acc: 46.4; ppl:  50.6; xent: 3.9; lr: 0.00049; sents:   68687; bsz: 3790/1311/86; 51022/17654 tok/s;  19018 sec;
[2024-06-02 23:38:51,303 INFO] Saving checkpoint /home/user/ahmadi/NMT/NMT/models/config_2_uniform/model_step_33000.pt
[2024-06-02 23:39:59,228 INFO] Step 33100/120000; acc: 46.2; ppl:  51.2; xent: 3.9; lr: 0.00049; sents:   67034; bsz: 3745/1348/84; 44097/15874 tok/s;  19086 sec;
[2024-06-02 23:40:53,964 INFO] Step 33200/120000; acc: 43.6; ppl:  57.4; xent: 4.1; lr: 0.00049; sents:  106660; bsz: 3350/1968/133; 48966/28763 tok/s;  19141 sec;
[2024-06-02 23:41:45,030 INFO] Step 33300/120000; acc: 43.6; ppl:  57.4; xent: 4.1; lr: 0.00048; sents:  112443; bsz: 3285/2144/141; 51464/33585 tok/s;  19192 sec;
[2024-06-02 23:42:40,495 INFO] Step 33400/120000; acc: 43.6; ppl:  57.4; xent: 4.1; lr: 0.00048; sents:  109879; bsz: 3356/2136/137; 48412/30806 tok/s;  19247 sec;
[2024-06-02 23:43:36,798 INFO] Step 33500/120000; acc: 41.6; ppl:  63.1; xent: 4.1; lr: 0.00048; sents:  110681; bsz: 3409/2333/138; 48434/33150 tok/s;  19303 sec;
[2024-06-02 23:44:28,596 INFO] Step 33600/120000; acc: 41.9; ppl:  62.1; xent: 4.1; lr: 0.00048; sents:  106083; bsz: 3451/2251/133; 53305/34761 tok/s;  19355 sec;
[2024-06-02 23:45:22,242 INFO] Step 33700/120000; acc: 41.9; ppl:  62.4; xent: 4.1; lr: 0.00048; sents:  112130; bsz: 3519/2259/140; 52481/33685 tok/s;  19409 sec;
[2024-06-02 23:46:16,541 INFO] Step 33800/120000; acc: 42.1; ppl:  61.9; xent: 4.1; lr: 0.00048; sents:  108336; bsz: 3524/2225/135; 51919/32787 tok/s;  19463 sec;
[2024-06-02 23:47:19,027 INFO] Step 33900/120000; acc: 42.4; ppl:  59.0; xent: 4.1; lr: 0.00048; sents:  114068; bsz: 3303/2381/143; 42292/30478 tok/s;  19526 sec;
[2024-06-02 23:48:20,824 INFO] Step 34000/120000; acc: 43.0; ppl:  53.8; xent: 4.0; lr: 0.00048; sents:  135863; bsz: 3085/2753/170; 39943/35641 tok/s;  19587 sec;
[2024-06-02 23:48:20,843 INFO] Saving checkpoint /home/user/ahmadi/NMT/NMT/models/config_2_uniform/model_step_34000.pt
[2024-06-02 23:49:23,899 INFO] Step 34100/120000; acc: 43.3; ppl:  52.8; xent: 4.0; lr: 0.00048; sents:  130695; bsz: 3085/2660/163; 39128/33734 tok/s;  19651 sec;
[2024-06-02 23:50:18,116 INFO] Step 34200/120000; acc: 43.2; ppl:  53.1; xent: 4.0; lr: 0.00048; sents:  135579; bsz: 3222/2640/169; 47540/38959 tok/s;  19705 sec;
[2024-06-02 23:51:17,280 INFO] Step 34300/120000; acc: 43.5; ppl:  52.5; xent: 4.0; lr: 0.00048; sents:  139587; bsz: 3211/2731/174; 43417/36925 tok/s;  19764 sec;
[2024-06-02 23:52:16,499 INFO] Step 34400/120000; acc: 44.4; ppl:  50.4; xent: 3.9; lr: 0.00048; sents:  133697; bsz: 3381/2595/167; 45682/35058 tok/s;  19823 sec;
[2024-06-02 23:53:13,790 INFO] Step 34500/120000; acc: 44.4; ppl:  50.7; xent: 3.9; lr: 0.00048; sents:  139098; bsz: 3331/2607/174; 46520/36400 tok/s;  19880 sec;
[2024-06-02 23:54:08,041 INFO] Step 34600/120000; acc: 44.9; ppl:  49.3; xent: 3.9; lr: 0.00048; sents:  133750; bsz: 3297/2602/167; 48617/38368 tok/s;  19935 sec;
[2024-06-02 23:55:08,959 INFO] Step 34700/120000; acc: 42.0; ppl:  59.6; xent: 4.1; lr: 0.00047; sents:  102109; bsz: 3602/2601/128; 47308/34159 tok/s;  19996 sec;
[2024-06-02 23:56:04,503 INFO] Step 34800/120000; acc: 40.6; ppl:  65.6; xent: 4.2; lr: 0.00047; sents:   88224; bsz: 3691/2549/110; 53156/36709 tok/s;  20051 sec;
[2024-06-02 23:56:54,845 INFO] Step 34900/120000; acc: 40.7; ppl:  65.2; xent: 4.2; lr: 0.00047; sents:   86754; bsz: 3736/2474/108; 59367/39312 tok/s;  20101 sec;
[2024-06-02 23:57:51,480 INFO] Step 35000/120000; acc: 40.9; ppl:  64.3; xent: 4.2; lr: 0.00047; sents:   86005; bsz: 3791/2503/108; 53550/35361 tok/s;  20158 sec;
[2024-06-02 23:57:51,495 INFO] Saving checkpoint /home/user/ahmadi/NMT/NMT/models/config_2_uniform/model_step_35000.pt
[2024-06-02 23:59:09,760 INFO] Step 35100/120000; acc: 41.0; ppl:  64.0; xent: 4.2; lr: 0.00047; sents:   90337; bsz: 3749/2571/113; 38315/26275 tok/s;  20236 sec;
[2024-06-03 00:00:09,909 INFO] Step 35200/120000; acc: 41.1; ppl:  63.5; xent: 4.2; lr: 0.00047; sents:   88527; bsz: 3732/2553/111; 49634/33956 tok/s;  20297 sec;
[2024-06-03 00:01:07,756 INFO] Step 35300/120000; acc: 39.9; ppl:  67.9; xent: 4.2; lr: 0.00047; sents:  102636; bsz: 3421/2908/128; 47307/40213 tok/s;  20354 sec;
[2024-06-03 00:01:59,101 INFO] Step 35400/120000; acc: 39.7; ppl:  68.7; xent: 4.2; lr: 0.00047; sents:  109714; bsz: 3278/2978/137; 51072/46396 tok/s;  20406 sec;
[2024-06-03 00:02:50,051 INFO] Step 35500/120000; acc: 39.7; ppl:  68.7; xent: 4.2; lr: 0.00047; sents:  111142; bsz: 3232/3047/139; 50756/47840 tok/s;  20457 sec;
[2024-06-03 00:03:42,426 INFO] Step 35600/120000; acc: 39.9; ppl:  67.8; xent: 4.2; lr: 0.00047; sents:  109913; bsz: 3291/3010/137; 50265/45970 tok/s;  20509 sec;
[2024-06-03 00:04:37,922 INFO] Step 35700/120000; acc: 40.6; ppl:  66.1; xent: 4.2; lr: 0.00047; sents:  102864; bsz: 3357/2679/129; 48396/38625 tok/s;  20565 sec;
[2024-06-03 00:05:36,283 INFO] Step 35800/120000; acc: 45.5; ppl:  52.8; xent: 4.0; lr: 0.00047; sents:   71478; bsz: 3672/1451/89; 50340/19886 tok/s;  20623 sec;
[2024-06-03 00:06:26,954 INFO] Step 35900/120000; acc: 45.8; ppl:  52.2; xent: 4.0; lr: 0.00047; sents:   71261; bsz: 3750/1386/89; 59199/21879 tok/s;  20674 sec;
[2024-06-03 00:07:23,087 INFO] Step 36000/120000; acc: 46.2; ppl:  50.7; xent: 3.9; lr: 0.00047; sents:   69777; bsz: 3742/1304/87; 53326/18580 tok/s;  20730 sec;
[2024-06-03 00:07:23,107 INFO] Saving checkpoint /home/user/ahmadi/NMT/NMT/models/config_2_uniform/model_step_36000.pt
[2024-06-03 00:08:21,558 INFO] Step 36100/120000; acc: 45.5; ppl:  52.2; xent: 4.0; lr: 0.00047; sents:   78986; bsz: 3680/1503/99; 50352/20566 tok/s;  20788 sec;
[2024-06-03 00:09:21,667 INFO] Step 36200/120000; acc: 43.4; ppl:  57.4; xent: 4.0; lr: 0.00046; sents:  107472; bsz: 3324/2093/134; 44234/27855 tok/s;  20848 sec;
[2024-06-03 00:10:21,829 INFO] Step 36300/120000; acc: 43.3; ppl:  58.0; xent: 4.1; lr: 0.00046; sents:  114245; bsz: 3352/2164/143; 44579/28773 tok/s;  20908 sec;
[2024-06-03 00:11:15,834 INFO] Step 36400/120000; acc: 43.9; ppl:  56.0; xent: 4.0; lr: 0.00046; sents:  106349; bsz: 3388/2099/133; 50193/31089 tok/s;  20962 sec;
[2024-06-03 00:12:09,717 INFO] Step 36500/120000; acc: 41.9; ppl:  61.6; xent: 4.1; lr: 0.00046; sents:  109859; bsz: 3386/2287/137; 50275/33962 tok/s;  21016 sec;
[2024-06-03 00:13:01,108 INFO] Step 36600/120000; acc: 42.3; ppl:  60.3; xent: 4.1; lr: 0.00046; sents:  105917; bsz: 3465/2237/132; 53943/34824 tok/s;  21068 sec;
[2024-06-03 00:13:53,025 INFO] Step 36700/120000; acc: 42.2; ppl:  60.6; xent: 4.1; lr: 0.00046; sents:  114151; bsz: 3398/2359/143; 52368/36352 tok/s;  21120 sec;
[2024-06-03 00:14:44,540 INFO] Step 36800/120000; acc: 42.3; ppl:  60.4; xent: 4.1; lr: 0.00046; sents:  115448; bsz: 3433/2339/144; 53306/36322 tok/s;  21171 sec;
[2024-06-03 00:15:36,142 INFO] Step 36900/120000; acc: 42.5; ppl:  58.7; xent: 4.1; lr: 0.00046; sents:  115005; bsz: 3425/2360/144; 53093/36581 tok/s;  21223 sec;
[2024-06-03 00:16:28,265 INFO] Step 37000/120000; acc: 43.0; ppl:   nan; xent: nan; lr: 0.00046; sents:  133408; bsz: 3133/2690/167; 48091/41290 tok/s;  21275 sec;
[2024-06-03 00:16:28,279 INFO] Saving checkpoint /home/user/ahmadi/NMT/NMT/models/config_2_uniform/model_step_37000.pt
[2024-06-03 00:17:43,819 INFO] Step 37100/120000; acc: 41.4; ppl:   nan; xent: nan; lr: 0.00046; sents:  135842; bsz: 3154/2693/170; 33398/28519 tok/s;  21350 sec;
[2024-06-03 00:18:36,918 INFO] Step 37200/120000; acc: 39.3; ppl:   nan; xent: nan; lr: 0.00046; sents:  132358; bsz: 3136/2691/165; 47246/40544 tok/s;  21404 sec;
[2024-06-03 00:19:34,950 INFO] Step 37300/120000; acc: 39.1; ppl:   nan; xent: nan; lr: 0.00046; sents:  136628; bsz: 3232/2667/171; 44562/36767 tok/s;  21462 sec;
[2024-06-03 00:20:36,457 INFO] Step 37400/120000; acc: 39.1; ppl:   nan; xent: nan; lr: 0.00046; sents:  125389; bsz: 3367/2618/157; 43791/34052 tok/s;  21523 sec;
[2024-06-03 00:21:38,142 INFO] Step 37500/120000; acc: 38.9; ppl:   nan; xent: nan; lr: 0.00046; sents:  127619; bsz: 3415/2540/160; 44294/32942 tok/s;  21585 sec;
[2024-06-03 00:22:30,864 INFO] Step 37600/120000; acc: 39.0; ppl:   nan; xent: nan; lr: 0.00046; sents:  139814; bsz: 3376/2675/175; 51235/40589 tok/s;  21637 sec;
[2024-06-03 00:23:28,395 INFO] Step 37700/120000; acc: 13.8; ppl:   nan; xent: nan; lr: 0.00046; sents:  107044; bsz: 3623/2501/134; 50374/34779 tok/s;  21695 sec;
[2024-06-03 00:24:18,345 INFO] Step 37800/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:   87043; bsz: 3692/2623/109; 59128/42017 tok/s;  21745 sec;
[2024-06-03 00:25:10,011 INFO] Step 37900/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:   91293; bsz: 3638/2595/114; 56336/40177 tok/s;  21797 sec;
[2024-06-03 00:26:10,770 INFO] Step 38000/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:   90784; bsz: 3668/2643/113; 48291/34802 tok/s;  21857 sec;
[2024-06-03 00:26:10,787 INFO] Saving checkpoint /home/user/ahmadi/NMT/NMT/models/config_2_uniform/model_step_38000.pt
[2024-06-03 00:27:49,933 INFO] Step 38100/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:   88325; bsz: 3753/2535/110; 30274/20453 tok/s;  21957 sec;
[2024-06-03 00:28:40,260 INFO] Step 38200/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:   91597; bsz: 3719/2603/114; 59111/41380 tok/s;  22007 sec;
[2024-06-03 00:29:37,968 INFO] Step 38300/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:   99621; bsz: 3399/2756/125; 47127/38206 tok/s;  22065 sec;
[2024-06-03 00:30:33,052 INFO] Step 38400/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:  107624; bsz: 3292/2936/135; 47805/42645 tok/s;  22120 sec;
[2024-06-03 00:31:27,930 INFO] Step 38500/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:  107520; bsz: 3362/2886/134; 49013/42067 tok/s;  22175 sec;
[2024-06-03 00:32:22,107 INFO] Step 38600/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:  106059; bsz: 3347/2845/133; 49417/42010 tok/s;  22229 sec;
[2024-06-03 00:33:19,386 INFO] Step 38700/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:   99210; bsz: 3386/2539/124; 47289/35457 tok/s;  22286 sec;
[2024-06-03 00:34:12,809 INFO] Step 38800/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:   75669; bsz: 3663/1476/95; 54850/22110 tok/s;  22339 sec;
[2024-06-03 00:35:05,837 INFO] Step 38900/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:   79536; bsz: 3694/1516/99; 55724/22869 tok/s;  22392 sec;
[2024-06-03 00:35:55,154 INFO] Step 39000/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:   70545; bsz: 3715/1407/88; 60263/22822 tok/s;  22442 sec;
[2024-06-03 00:35:55,169 INFO] Saving checkpoint /home/user/ahmadi/NMT/NMT/models/config_2_uniform/model_step_39000.pt
[2024-06-03 00:36:51,486 INFO] Step 39100/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:   84403; bsz: 3641/1556/106; 51714/22101 tok/s;  22498 sec;
[2024-06-03 00:37:43,863 INFO] Step 39200/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:  108999; bsz: 3315/2186/136; 50639/33384 tok/s;  22550 sec;
[2024-06-03 00:38:33,750 INFO] Step 39300/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:  107820; bsz: 3366/2068/135; 53980/33170 tok/s;  22600 sec;
[2024-06-03 00:39:23,466 INFO] Step 39400/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00045; sents:  104178; bsz: 3408/2087/130; 54834/33588 tok/s;  22650 sec;
[2024-06-03 00:40:21,522 INFO] Step 39500/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00044; sents:  109186; bsz: 3501/2258/136; 48248/31110 tok/s;  22708 sec;
[2024-06-03 00:41:22,401 INFO] Step 39600/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00044; sents:  112255; bsz: 3425/2370/140; 45014/31150 tok/s;  22769 sec;
[2024-06-03 00:42:19,883 INFO] Step 39700/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00044; sents:  114401; bsz: 3363/2372/143; 46812/33010 tok/s;  22827 sec;
[2024-06-03 00:43:11,747 INFO] Step 39800/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00044; sents:  112834; bsz: 3411/2295/141; 52619/35396 tok/s;  22878 sec;
[2024-06-03 00:44:11,802 INFO] Step 39900/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00044; sents:  118180; bsz: 3326/2415/148; 44304/32177 tok/s;  22938 sec;
[2024-06-03 00:45:12,702 INFO] Step 40000/120000; acc: 0.0; ppl:   nan; xent: nan; lr: 0.00044; sents:  137111; bsz: 3177/2701/171; 41733/35477 tok/s;  22999 sec;
[2024-06-03 00:45:26,459 INFO] valid stats calculation
                           took: 13.754674196243286 s.

Thanks for helping out.

Hello!

Let’s please revise this:

  • Early stopping is defined by OpenNMT-py as ‘number of validation steps without improving.’
  • From the log, it appears that you are saving a checkpoint after 10 validation steps.
  • Your model checkpoint file name has the number of the validation step.
  • You saved 10 checkpoints. You say the best model is “model_step_30000.pt”, and you cannot find it as it is overwritten/deleted.

There are a few points here:

  • You probably have a test set. Consider evaluating these checkpoints on the test set based on common metrics like BLEU, ChrF++, and COMET.
  • If you find that these models with 0 accuracy perform so badly, exclude them.
  • Average all the models that do not have 0 accuracy and evaluate the output model.
  • Average the best performing models and evaluate the output model.

All the best,
Yasmin

Dear Yasmin,

Thanks for your insightful feedback and the accurate description of the problem.

I indeed tested a few of my models at different checkpoints (84 training setups overall) and all I get in output is <blank>. So, I think there is something wrong with the whole training. My training corpus contains over 3M parallel sentences. I know it’s not too many but at least, I expect the model to learn something which doesn’t seem to be the case.

I might also add that I am training a multilingual model of 6 languages to English. I don’t specify the language of the sentence (as in adding the language code in <> to the beginning of the sentences) and my training data is a mix of sentences in all those languages. Do you think this causes such a weird result?

I appreciate your comments.

3M parallel sentences should still give you something. Consider revising the configuration, and tokenization (SentencePiece), etc.

I do not know what your background with OpenNMT-py is. If this is the first time to try it, maybe start with the tutorial.

I see.

Do you know how I can save the best model? keep_checkpoint isn’t ideal in my case!

This is my configuration:

save_data: /home/user/anony/NMT/NMT/models/config_1
log_file: /home/user/anony/NMT/NMT/models/config_1/train.log
save_model: /home/user/anony/NMT/NMT/models/config_1/model

src_vocab: /home/user/anony/NMT/NMT/models/config_1/vocab.src
tgt_vocab: /home/user/anony/NMT/NMT/models/config_1/vocab.tgt

src_vocab_size: 8192
tgt_vocab_size: 8192

# Prevent overwriting existing files in the folder
overwrite: True

# Corpus opts:
data:
    corpus_1:
        path_src: /home/user/anony/NMT/data/source_down_train_downscaled_GPT2_BERT_8192.txt
        path_tgt: /home/user/anony/NMT/data/target_down_train_downscaled_GPT2_BERT_8192.txt
    valid:
        path_src: /home/user/anony/NMT/data/source_down_val_downscaled_GPT2_BERT_8192.txt
        path_tgt: /home/user/anony/NMT/data/target_down_val_downscaled_GPT2_BERT_8192.txt

save_checkpoint_steps: 1000
keep_checkpoint: 10
seed: 3435
train_steps: 120000
valid_steps: 10000
warmup_steps: 8000
report_every: 100

decoder_type: transformer
encoder_type: transformer
word_vec_size: 512
hidden_size: 512
layers: 6
transformer_ff: 2048
heads: 8

model_dtype: "fp16"
accum_count: 8
optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0

batch_size: 4096
valid_batch_size: 4096
batch_type: tokens
normalization: tokens
dropout: 0.1
label_smoothing: 0.1

param_init: 0.0
param_init_glorot: 'true'
position_encoding: 'true'

world_size: 1
gpu_ranks: [0]

For future reference, the best model is indeed overwritten if it appears after the value that you set to keep_checkpoint. So, what I ended up doing is that I removed keep_checkpoint to save all the checkpoints and once the training is over, I remove all the checkpoint steps except that of the best model.

It would have been great to have an option in the configuration to take care of this.

1 Like

I think there used to be a PR for saving the best model, but I cannot see it in the latest version of OpenNMT-py anymore. @francoishernandez