Epochs Determination

How can I determine epochs?

Training loops over preprocessed shards. You can have a look at the logs to see when it loops back to shard 0. (Yet if you’re using the multiple dataset iterator, it won’t mean anything because ‘epochs’ won’t be the same depending on the dataset.)

I have 249382 samples on training data and 32000 for valid/test data
and batch size = 512 , it needs 249382 /512 steps for one epcoh?
if yes … when epoch end?

You also need to take into account the number of GPU you use and eventual gradient accumulation (-accum_count) to have your ‘true’ batch size.
E.g. with -batch_size 512, 2 GPUs, -accum_count 3, one step will be 512 * 2 * 3 examples.
You can check it in the log, when it loads the data again, hence starting an epoch, it will log ‘Loading dataset …’. (This may be a bit off because examples are loaded in advance to improve training speed and GPU utilization.)

1 Like

Ok, I have thiese args for my system as following:
!onmt_train -batch_size 512
-accum_count 3
-layers 1
-rnn_size 128
-data data/data
-pre_word_vecs_enc “data/embeddings.enc.pt”
-pre_word_vecs_dec “data/embeddings.dec.pt”
-src_word_vec_size 224
-tgt_word_vec_size 336
-fix_word_vecs_enc
-fix_word_vecs_dec
-save_model data/model
-save_checkpoint_steps 100
-train_steps 1000
-model_type text
-encoder_type rnn
-decoder_type rnn
-rnn_type GRU
-global_attention dot
-global_attention_function softmax
-early_stopping 10
-optim adam
-learning_rate 0.001
-valid_steps 100
-dropout .2
-attention_dropout .3
and the log is
[2019-10-11 13:47:04,223 INFO] * src vocab size = 75060
[2019-10-11 13:47:04,223 INFO] * tgt vocab size = 13146
[2019-10-11 13:47:04,223 INFO] Building model…
/usr/local/lib/python3.5/dist-packages/torch/nn/modules/rnn.py:51: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1
“num_layers={}”.format(dropout, num_layers))
[2019-10-11 13:47:04,704 INFO] NMTModel(
(encoder): RNNEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(75060, 224, padding_idx=1)
)
)
)
(rnn): GRU(224, 128, dropout=0.2)
)
(decoder): InputFeedRNNDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(13146, 336, padding_idx=1)
)
)
)
(dropout): Dropout(p=0.2, inplace=False)
(rnn): StackedGRU(
(dropout): Dropout(p=0.2, inplace=False)
(layers): ModuleList(
(0): GRUCell(464, 128)
)
)
(attn): GlobalAttention(
(linear_out): Linear(in_features=256, out_features=128, bias=False)
)
)
(generator): Sequential(
(0): Linear(in_features=128, out_features=13146, bias=True)
(1): Cast()
(2): LogSoftmax()
)
)
[2019-10-11 13:47:04,704 INFO] encoder: 16949376
[2019-10-11 13:47:04,704 INFO] decoder: 6373754
[2019-10-11 13:47:04,704 INFO] * number of parameters: 23323130
[2019-10-11 13:47:04,706 INFO] Starting training on CPU, could be very slow
[2019-10-11 13:47:04,706 INFO] Start training loop and validate every 100 steps…
[2019-10-11 13:47:04,706 INFO] Loading dataset from data/data.train.0.pt
[2019-10-11 13:47:13,355 INFO] number of examples: 249382
note: I have not gpu on my machine
and I cannot choose encoder type brnn (not valid arg) why?
and still confuesed about epochs

-encoder_type brnn works fine on my end with your exact same config - without pretrained embeddings though, I don’t have any at hand. Can you try without -pre_word_vecs_enc “data/embeddings.enc.pt” -pre_word_vecs_dec “data/embeddings.dec.pt” to see if it could be some side-effect (though I doubt it)?
And please provide a trace of any error you might encounter.

Regarding epochs, please do a bit of research, it’s a fairly common topic: https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/

Also, don’t expect to go very far without a GPU. Depending on the task, it could take weeks to properly train a simple model on CPU.

I know the differnece between epoch and batch , bu I mean epoch in OpenNMT so I copied the log to see details of my train

You don’t see much because you’re training on CPU. You can set -report_every 1 to report every step and you’ll see that each step takes very long, hence not reaching even an epoch.

[2019-10-11 16:22:44,362 INFO] encoder: 16949376
[2019-10-11 16:22:44,362 INFO] decoder: 6373754
[2019-10-11 16:22:44,362 INFO] * number of parameters: 23323130
[2019-10-11 16:22:44,364 INFO] Starting training on CPU, could be very slow
[2019-10-11 16:22:44,364 INFO] Start training loop and validate every 100 steps…
[2019-10-11 16:22:44,364 INFO] Loading dataset from data/data.train.0.pt
[2019-10-11 16:22:55,757 INFO] number of examples: 249382
[2019-10-11 16:23:38,329 INFO] Step 1/ 200; acc: 0.01; ppl: 13891.76; xent: 9.54; lr: 0.00100; 3480/365 tok/s; 54 sec
[2019-10-11 16:23:43,293 INFO] Step 2/ 200; acc: 0.04; ppl: 12477.28; xent: 9.43; lr: 0.00100; 11449/3340 tok/s; 59 sec
[2019-10-11 16:23:48,037 INFO] Step 3/ 200; acc: 0.71; ppl: 11708.22; xent: 9.37; lr: 0.00100; 9197/4168 tok/s; 64 sec
[2019-10-11 16:24:14,775 INFO] Step 4/ 200; acc: 2.95; ppl: 10611.20; xent: 9.27; lr: 0.00100; 5276/768 tok/s; 90 sec
[2019-10-11 16:24:26,897 INFO] Step 5/ 200; acc: 7.49; ppl: 9105.57; xent: 9.12; lr: 0.00100; 8195/1631 tok/s; 103 sec
[2019-10-11 16:24:36,362 INFO] Step 6/ 200; acc: 8.50; ppl: 8072.00; xent: 9.00; lr: 0.00100; 7595/2141 tok/s; 112 sec
[2019-10-11 16:24:40,943 INFO] Step 7/ 200; acc: 7.45; ppl: 7430.63; xent: 8.91; lr: 0.00100; 6147/5808 tok/s; 117 sec
[2019-10-11 16:24:53,996 INFO] Step 8/ 200; acc: 12.88; ppl: 5642.40; xent: 8.64; lr: 0.00100; 7637/1228 tok/s; 130 sec
[2019-10-11 16:25:11,049 INFO] Step 9/ 200; acc: 10.33; ppl: 5029.12; xent: 8.52; lr: 0.00100; 6923/1214 tok/s; 147 sec
[2019-10-11 16:25:14,776 INFO] Step 10/ 200; acc: 9.49; ppl: 4259.83; xent: 8.36; lr: 0.00100; 8522/5369 tok/s; 150 sec
[2019-10-11 16:25:25,449 INFO] Step 11/ 200; acc: 9.63; ppl: 3417.45; xent: 8.14; lr: 0.00100; 9085/1754 tok/s; 161 sec
[2019-10-11 16:25:33,974 INFO] Step 12/ 200; acc: 7.92; ppl: 3064.55; xent: 8.03; lr: 0.00100; 8749/2556 tok/s; 170 sec
[2019-10-11 16:25:56,050 INFO] Step 13/ 200; acc: 7.61; ppl: 2604.44; xent: 7.86; lr: 0.00100; 6623/1013 tok/s; 192 sec
[2019-10-11 16:26:15,988 INFO] Step 14/ 200; acc: 6.68; ppl: 2393.13; xent: 7.78; lr: 0.00100; 6831/1231 tok/s; 212 sec
[2019-10-11 16:26:29,641 INFO] Step 15/ 200; acc: 8.59; ppl: 1822.84; xent: 7.51; lr: 0.00100; 7035/1361 tok/s; 225 sec
[2019-10-11 16:26:39,035 INFO] Step 16/ 200; acc: 7.08; ppl: 1705.28; xent: 7.44; lr: 0.00100; 7563/2346 tok/s; 235 sec
[2019-10-11 16:26:42,565 INFO] Step 17/ 200; acc: 8.27; ppl: 1388.46; xent: 7.24; lr: 0.00100; 10733/5412 tok/s; 238 sec
[2019-10-11 16:27:07,867 INFO] Step 18/ 200; acc: 7.29; ppl: 1324.30; xent: 7.19; lr: 0.00100; 6249/852 tok/s; 264 sec
[2019-10-11 16:27:19,690 INFO] Step 19/ 200; acc: 6.89; ppl: 1172.15; xent: 7.07; lr: 0.00100; 7106/1914 tok/s; 275 sec
[2019-10-11 16:27:27,367 INFO] Step 20/ 200; acc: 10.58; ppl: 944.37; xent: 6.85; lr: 0.00100; 9461/1906 tok/s; 283 sec
[2019-10-11 16:28:07,700 INFO] Step 21/ 200; acc: 7.41; ppl: 957.33; xent: 6.86; lr: 0.00100; 4801/518 tok/s; 323 sec
[2019-10-11 16:28:23,341 INFO] Step 22/ 200; acc: 8.67; ppl: 825.43; xent: 6.72; lr: 0.00100; 6758/1138 tok/s; 339 sec
[2019-10-11 16:28:45,963 INFO] Step 23/ 200; acc: 8.45; ppl: 785.08; xent: 6.67; lr: 0.00100; 6145/812 tok/s; 362 sec
[2019-10-11 16:29:04,498 INFO] Step 24/ 200; acc: 7.21; ppl: 755.49; xent: 6.63; lr: 0.00100; 6239/1156 tok/s; 380 sec
[2019-10-11 16:29:08,356 INFO] Step 25/ 200; acc: 7.12; ppl: 698.37; xent: 6.55; lr: 0.00100; 8496/5621 tok/s; 384 sec
[2019-10-11 16:29:22,618 INFO] Step 26/ 200; acc: 6.32; ppl: 683.40; xent: 6.53; lr: 0.00100; 7620/1738 tok/s; 398 sec
[2019-10-11 16:29:30,332 INFO] Step 27/ 200; acc: 6.72; ppl: 617.86; xent: 6.43; lr: 0.00100; 7697/3032 tok/s; 406 sec
[2019-10-11 16:29:36,673 INFO] Step 28/ 200; acc: 6.38; ppl: 632.67; xent: 6.45; lr: 0.00100; 6622/3973 tok/s; 412 sec
[2019-10-11 16:29:51,246 INFO] Step 29/ 200; acc: 9.87; ppl: 514.05; xent: 6.24; lr: 0.00100; 7011/1100 tok/s; 427 sec
[2019-10-11 16:30:17,634 INFO] Step 30/ 200; acc: 7.98; ppl: 572.56; xent: 6.35; lr: 0.00100; 5728/819 tok/s; 453 sec
[2019-10-11 16:30:27,849 INFO] Step 31/ 200; acc: 9.30; ppl: 517.87; xent: 6.25; lr: 0.00100; 8033/1798 tok/s; 463 sec
[2019-10-11 16:30:53,248 INFO] Step 32/ 200; acc: 7.34; ppl: 613.39; xent: 6.42; lr: 0.00100; 4850/1244 tok/s; 489 sec
[2019-10-11 16:31:08,722 INFO] Step 33/ 200; acc: 9.66; ppl: 510.08; xent: 6.23; lr: 0.00100; 7580/1364 tok/s; 504 sec
[2019-10-11 16:31:20,705 INFO] Step 34/ 200; acc: 9.46; ppl: 483.07; xent: 6.18; lr: 0.00100; 7714/1834 tok/s; 516 sec
[2019-10-11 16:31:33,054 INFO] Step 35/ 200; acc: 9.65; ppl: 469.14; xent: 6.15; lr: 0.00100; 7546/1501 tok/s; 529 sec
[2019-10-11 16:31:38,889 INFO] Step 36/ 200; acc: 8.89; ppl: 447.97; xent: 6.10; lr: 0.00100; 9004/3172 tok/s; 535 sec
[2019-10-11 16:31:57,536 INFO] Step 37/ 200; acc: 8.20; ppl: 555.79; xent: 6.32; lr: 0.00100; 5723/1259 tok/s; 553 sec
[2019-10-11 16:32:10,922 INFO] Step 38/ 200; acc: 9.08; ppl: 466.63; xent: 6.15; lr: 0.00100; 8721/1348 tok/s; 567 sec
[2019-10-11 16:32:15,849 INFO] Step 39/ 200; acc: 9.70; ppl: 410.20; xent: 6.02; lr: 0.00100; 11952/3246 tok/s; 571 sec
[2019-10-11 16:32:29,457 INFO] Step 40/ 200; acc: 9.18; ppl: 427.63; xent: 6.06; lr: 0.00100; 8175/1210 tok/s; 585 sec
[2019-10-11 16:32:42,901 INFO] Step 41/ 200; acc: 9.36; ppl: 385.79; xent: 5.96; lr: 0.00100; 5830/1297 tok/s; 599 sec
[2019-10-11 16:32:54,077 INFO] Step 42/ 200; acc: 8.50; ppl: 463.01; xent: 6.14; lr: 0.00100; 6368/2044 tok/s; 610 sec
[2019-10-11 16:33:01,033 INFO] Step 43/ 200; acc: 12.84; ppl: 366.67; xent: 5.90; lr: 0.00100; 7965/2825 tok/s; 617 sec
[2019-10-11 16:33:27,791 INFO] Step 44/ 200; acc: 9.94; ppl: 514.37; xent: 6.24; lr: 0.00100; 6637/820 tok/s; 643 sec
[2019-10-11 16:33:34,425 INFO] Step 45/ 200; acc: 12.14; ppl: 371.46; xent: 5.92; lr: 0.00100; 9264/2693 tok/s; 650 sec
[2019-10-11 16:33:41,934 INFO] Step 46/ 200; acc: 11.15; ppl: 446.31; xent: 6.10; lr: 0.00100; 9069/3259 tok/s; 658 sec
[2019-10-11 16:33:50,514 INFO] Step 47/ 200; acc: 11.52; ppl: 460.00; xent: 6.13; lr: 0.00100; 9107/2538 tok/s; 666 sec
[2019-10-11 16:34:10,462 INFO] Step 48/ 200; acc: 11.35; ppl: 487.17; xent: 6.19; lr: 0.00100; 6507/1165 tok/s; 686 sec
[2019-10-11 16:34:15,805 INFO] Step 49/ 200; acc: 13.22; ppl: 394.58; xent: 5.98; lr: 0.00100; 9871/3279 tok/s; 691 sec
[2019-10-11 16:34:28,581 INFO] Step 50/ 200; acc: 13.10; ppl: 430.09; xent: 6.06; lr: 0.00100; 6603/1529 tok/s; 704 sec
[2019-10-11 16:34:40,149 INFO] Step 51/ 200; acc: 12.44; ppl: 484.26; xent: 6.18; lr: 0.00100; 7422/1835 tok/s; 716 sec
[2019-10-11 16:34:48,497 INFO] Step 52/ 200; acc: 15.42; ppl: 332.32; xent: 5.81; lr: 0.00100; 8648/2299 tok/s; 724 sec
[2019-10-11 16:35:14,649 INFO] Step 53/ 200; acc: 13.30; ppl: 439.72; xent: 6.09; lr: 0.00100; 5108/863 tok/s; 750 sec
[2019-10-11 16:35:37,162 INFO] Step 54/ 200; acc: 12.33; ppl: 496.77; xent: 6.21; lr: 0.00100; 5679/1058 tok/s; 773 sec
[2019-10-11 16:36:06,979 INFO] Step 55/ 200; acc: 13.44; ppl: 462.05; xent: 6.14; lr: 0.00100; 5446/725 tok/s; 803 sec
[2019-10-11 16:36:34,863 INFO] Step 56/ 200; acc: 12.29; ppl: 468.08; xent: 6.15; lr: 0.00100; 5287/905 tok/s; 830 sec
[2019-10-11 16:37:00,683 INFO] Step 57/ 200; acc: 13.01; ppl: 455.93; xent: 6.12; lr: 0.00100; 5575/859 tok/s; 856 sec
[2019-10-11 16:37:09,931 INFO] Step 58/ 200; acc: 15.62; ppl: 371.55; xent: 5.92; lr: 0.00100; 9412/1855 tok/s; 866 sec
[2019-10-11 16:37:19,348 INFO] Step 59/ 200; acc: 13.54; ppl: 413.36; xent: 6.02; lr: 0.00100; 8972/2188 tok/s; 875 sec
[2019-10-11 16:37:26,756 INFO] Step 60/ 200; acc: 16.12; ppl: 339.92; xent: 5.83; lr: 0.00100; 10230/2185 tok/s; 882 sec
[2019-10-11 16:37:45,842 INFO] Step 61/ 200; acc: 14.57; ppl: 366.34; xent: 5.90; lr: 0.00100; 6071/1101 tok/s; 901 sec
[2019-10-11 16:38:00,604 INFO] Step 62/ 200; acc: 15.10; ppl: 374.24; xent: 5.92; lr: 0.00100; 5786/1336 tok/s; 916 sec
[2019-10-11 16:38:15,715 INFO] Step 63/ 200; acc: 12.71; ppl: 464.75; xent: 6.14; lr: 0.00100; 6701/1660 tok/s; 931 sec
[2019-10-11 16:38:25,081 INFO] Step 64/ 200; acc: 14.37; ppl: 396.16; xent: 5.98; lr: 0.00100; 8693/2275 tok/s; 941 sec
[2019-10-11 16:38:35,780 INFO] Step 65/ 200; acc: 15.54; ppl: 399.03; xent: 5.99; lr: 0.00100; 6834/1710 tok/s; 951 sec
[2019-10-11 16:38:41,652 INFO] Step 66/ 200; acc: 14.12; ppl: 402.70; xent: 6.00; lr: 0.00100; 6197/3617 tok/s; 957 sec
[2019-10-11 16:38:47,819 INFO] Step 67/ 200; acc: 14.10; ppl: 477.39; xent: 6.17; lr: 0.00100; 8915/3460 tok/s; 963 sec
[2019-10-11 16:39:04,808 INFO] Step 68/ 200; acc: 14.87; ppl: 395.49; xent: 5.98; lr: 0.00100; 6223/1236 tok/s; 980 sec
[2019-10-11 16:39:09,760 INFO] Step 69/ 200; acc: 14.16; ppl: 373.47; xent: 5.92; lr: 0.00100; 9045/4240 tok/s; 985 sec
[2019-10-11 16:39:17,861 INFO] Step 70/ 200; acc: 16.41; ppl: 336.33; xent: 5.82; lr: 0.00100; 8258/2168 tok/s; 993 sec
[2019-10-11 16:39:27,800 INFO] Step 71/ 200; acc: 14.46; ppl: 364.85; xent: 5.90; lr: 0.00100; 8244/2232 tok/s; 1003 sec
[2019-10-11 16:39:53,299 INFO] Step 72/ 200; acc: 16.25; ppl: 371.63; xent: 5.92; lr: 0.00100; 5806/716 tok/s; 1029 sec
[2019-10-11 16:40:09,126 INFO] Step 73/ 200; acc: 17.20; ppl: 352.40; xent: 5.86; lr: 0.00100; 6555/1049 tok/s; 1045 sec
[2019-10-11 16:40:15,686 INFO] Step 74/ 200; acc: 17.36; ppl: 329.18; xent: 5.80; lr: 0.00100; 8807/2502 tok/s; 1051 sec
[2019-10-11 16:40:18,612 INFO] Step 75/ 200; acc: 19.18; ppl: 293.90; xent: 5.68; lr: 0.00100; 14175/4969 tok/s; 1054 sec
[2019-10-11 16:40:49,384 INFO] Step 76/ 200; acc: 14.81; ppl: 384.68; xent: 5.95; lr: 0.00100; 5633/690 tok/s; 1085 sec
[2019-10-11 16:41:05,190 INFO] Step 77/ 200; acc: 14.65; ppl: 364.90; xent: 5.90; lr: 0.00100; 7339/1443 tok/s; 1101 sec
[2019-10-11 16:41:13,057 INFO] Step 78/ 200; acc: 16.30; ppl: 335.06; xent: 5.81; lr: 0.00100; 7530/2473 tok/s; 1109 sec
[2019-10-11 16:41:27,562 INFO] Step 79/ 200; acc: 14.90; ppl: 388.64; xent: 5.96; lr: 0.00100; 6564/1537 tok/s; 1123 sec
[2019-10-11 16:41:34,006 INFO] Step 80/ 200; acc: 18.90; ppl: 291.37; xent: 5.67; lr: 0.00100; 9051/2508 tok/s; 1130 sec
[2019-10-11 16:41:39,338 INFO] Step 81/ 200; acc: 13.18; ppl: 419.06; xent: 6.04; lr: 0.00100; 3675/5150 tok/s; 1135 sec
[2019-10-11 16:41:44,350 INFO] Step 82/ 200; acc: 17.13; ppl: 360.70; xent: 5.89; lr: 0.00100; 9399/3582 tok/s; 1140 sec
[2019-10-11 16:41:59,091 INFO] Step 83/ 200; acc: 16.54; ppl: 326.70; xent: 5.79; lr: 0.00100; 7720/1245 tok/s; 1155 sec
[2019-10-11 16:42:09,634 INFO] Step 84/ 200; acc: 13.16; ppl: 410.09; xent: 6.02; lr: 0.00100; 6532/2644 tok/s; 1165 sec
[2019-10-11 16:42:17,968 INFO] Step 85/ 200; acc: 15.53; ppl: 350.64; xent: 5.86; lr: 0.00100; 9891/2710 tok/s; 1174 sec
[2019-10-11 16:42:35,737 INFO] Step 86/ 200; acc: 16.53; ppl: 348.54; xent: 5.85; lr: 0.00100; 6018/1174 tok/s; 1191 sec
[2019-10-11 16:42:43,547 INFO] Step 87/ 200; acc: 18.52; ppl: 294.80; xent: 5.69; lr: 0.00100; 8521/2171 tok/s; 1199 sec
[2019-10-11 16:43:08,478 INFO] Step 88/ 200; acc: 18.52; ppl: 309.26; xent: 5.73; lr: 0.00100; 5429/661 tok/s; 1224 sec
[2019-10-11 16:43:14,662 INFO] Step 89/ 200; acc: 18.38; ppl: 307.15; xent: 5.73; lr: 0.00100; 10979/2668 tok/s; 1230 sec
[2019-10-11 16:43:39,941 INFO] Step 90/ 200; acc: 18.10; ppl: 313.80; xent: 5.75; lr: 0.00100; 5832/699 tok/s; 1256 sec
[2019-10-11 16:43:58,208 INFO] Step 91/ 200; acc: 12.23; ppl: 409.02; xent: 6.01; lr: 0.00100; 5802/1721 tok/s; 1274 sec
[2019-10-11 16:44:03,964 INFO] Step 92/ 200; acc: 17.98; ppl: 284.09; xent: 5.65; lr: 0.00100; 7884/3346 tok/s; 1280 sec
[2019-10-11 16:44:08,583 INFO] Step 93/ 200; acc: 19.20; ppl: 263.68; xent: 5.57; lr: 0.00100; 10421/3689 tok/s; 1284 sec
[2019-10-11 16:44:15,584 INFO] Step 94/ 200; acc: 15.31; ppl: 320.61; xent: 5.77; lr: 0.00100; 5892/3554 tok/s; 1291 sec
[2019-10-11 16:44:29,894 INFO] Step 95/ 200; acc: 21.59; ppl: 238.02; xent: 5.47; lr: 0.00100; 7147/1385 tok/s; 1306 sec
[2019-10-11 16:44:33,584 INFO] Step 96/ 200; acc: 18.00; ppl: 268.80; xent: 5.59; lr: 0.00100; 9185/4913 tok/s; 1309 sec
[2019-10-11 16:44:45,816 INFO] Step 97/ 200; acc: 17.86; ppl: 283.17; xent: 5.65; lr: 0.00100; 6434/1542 tok/s; 1321 sec
[2019-10-11 16:45:10,765 INFO] Step 98/ 200; acc: 15.60; ppl: 326.92; xent: 5.79; lr: 0.00100; 5746/902 tok/s; 1346 sec
[2019-10-11 16:45:12,755 INFO] Step 99/ 200; acc: 27.99; ppl: 148.84; xent: 5.00; lr: 0.00100; 11838/6028 tok/s; 1348 sec
[2019-10-11 16:45:19,407 INFO] Step 100/ 200; acc: 21.86; ppl: 226.33; xent: 5.42; lr: 0.00100; 5828/2241 tok/s; 1355 sec
[2019-10-11 16:45:19,408 INFO] Loading dataset from data/data.valid.0.pt
[2019-10-11 16:45:22,587 INFO] number of examples: 31174
[2019-10-11 16:46:38,339 INFO] Validation perplexity: 231.081
[2019-10-11 16:46:38,339 INFO] Validation accuracy: 18.4887
[2019-10-11 16:46:38,340 INFO] Model is improving ppl: inf --> 231.081.
[2019-10-11 16:46:38,340 INFO] Model is improving acc: -inf --> 18.4887.
[2019-10-11 16:46:40,417 INFO] Saving checkpoint data/model_step_100.pt
[2019-10-11 16:46:46,657 INFO] Step 101/ 200; acc: 21.92; ppl: 230.60; xent: 5.44; lr: 0.00100; 457/194 tok/s; 1442 sec
[2019-10-11 16:47:08,473 INFO] Step 102/ 200; acc: 15.10; ppl: 348.64; xent: 5.85; lr: 0.00100; 5560/1172 tok/s; 1464 sec
[2019-10-11 16:47:15,029 INFO] Step 103/ 200; acc: 19.45; ppl: 248.97; xent: 5.52; lr: 0.00100; 9997/2955 tok/s; 1471 sec
[2019-10-11 16:47:19,967 INFO] Step 104/ 200; acc: 20.27; ppl: 195.92; xent: 5.28; lr: 0.00100; 9747/3222 tok/s; 1476 sec
[2019-10-11 16:47:24,972 INFO] Step 105/ 200; acc: 23.53; ppl: 165.10; xent: 5.11; lr: 0.00100; 12174/3524 tok/s; 1481 sec
[2019-10-11 16:47:33,218 INFO] Step 106/ 200; acc: 14.89; ppl: 334.69; xent: 5.81; lr: 0.00100; 5766/3325 tok/s; 1489 sec
[2019-10-11 16:47:36,376 INFO] Step 107/ 200; acc: 21.46; ppl: 206.75; xent: 5.33; lr: 0.00100; 11188/5278 tok/s; 1492 sec
[2019-10-11 16:48:06,836 INFO] Step 108/ 200; acc: 17.18; ppl: 317.20; xent: 5.76; lr: 0.00100; 5638/666 tok/s; 1522 sec
[2019-10-11 16:48:11,236 INFO] Step 109/ 200; acc: 20.17; ppl: 232.58; xent: 5.45; lr: 0.00100; 9919/4658 tok/s; 1527 sec
[2019-10-11 16:48:15,289 INFO] Step 110/ 200; acc: 22.90; ppl: 202.59; xent: 5.31; lr: 0.00100; 11499/3912 tok/s; 1531 sec
[2019-10-11 16:48:30,574 INFO] Step 111/ 200; acc: 17.64; ppl: 290.27; xent: 5.67; lr: 0.00100; 7733/1347 tok/s; 1546 sec
[2019-10-11 16:48:40,995 INFO] Step 112/ 200; acc: 15.10; ppl: 356.18; xent: 5.88; lr: 0.00100; 5842/2467 tok/s; 1557 sec
[2019-10-11 16:48:48,724 INFO] Step 113/ 200; acc: 20.20; ppl: 263.73; xent: 5.57; lr: 0.00100; 10800/2124 tok/s; 1564 sec
[2019-10-11 16:49:02,084 INFO] Step 114/ 200; acc: 16.72; ppl: 294.32; xent: 5.68; lr: 0.00100; 7432/1615 tok/s; 1578 sec
[2019-10-11 16:49:06,196 INFO] Step 115/ 200; acc: 17.27; ppl: 268.62; xent: 5.59; lr: 0.00100; 6850/5582 tok/s; 1582 sec
[2019-10-11 16:49:11,112 INFO] Step 116/ 200; acc: 19.72; ppl: 246.25; xent: 5.51; lr: 0.00100; 12084/3804 tok/s; 1587 sec
[2019-10-11 16:49:19,053 INFO] Step 117/ 200; acc: 17.52; ppl: 265.48; xent: 5.58; lr: 0.00100; 6239/3273 tok/s; 1595 sec
[2019-10-11 16:49:28,149 INFO] Step 118/ 200; acc: 17.88; ppl: 274.51; xent: 5.62; lr: 0.00100; 9231/2409 tok/s; 1604 sec
[2019-10-11 16:49:39,185 INFO] Step 119/ 200; acc: 17.15; ppl: 282.54; xent: 5.64; lr: 0.00100; 7914/2156 tok/s; 1615 sec
[2019-10-11 16:50:01,355 INFO] Step 120/ 200; acc: 17.44; ppl: 295.23; xent: 5.69; lr: 0.00100; 6817/1000 tok/s; 1637 sec
[2019-10-11 16:50:08,809 INFO] Step 121/ 200; acc: 17.57; ppl: 280.75; xent: 5.64; lr: 0.00100; 5908/3006 tok/s; 1644 sec
[2019-10-11 16:50:13,140 INFO] Step 122/ 200; acc: 22.49; ppl: 196.74; xent: 5.28; lr: 0.00100; 11585/3833 tok/s; 1649 sec
[2019-10-11 16:50:22,885 INFO] Step 123/ 200; acc: 17.92; ppl: 276.20; xent: 5.62; lr: 0.00100; 9195/2308 tok/s; 1659 sec
[2019-10-11 16:50:29,889 INFO] Step 124/ 200; acc: 21.69; ppl: 187.56; xent: 5.23; lr: 0.00100; 8480/3502 tok/s; 1666 sec
[2019-10-11 16:50:37,419 INFO] Step 125/ 200; acc: 17.79; ppl: 260.48; xent: 5.56; lr: 0.00100; 9326/3028 tok/s; 1673 sec
[2019-10-11 16:50:41,497 INFO] Step 126/ 200; acc: 23.76; ppl: 174.97; xent: 5.16; lr: 0.00100; 10171/4915 tok/s; 1677 sec
[2019-10-11 16:50:48,158 INFO] Step 127/ 200; acc: 21.05; ppl: 227.10; xent: 5.43; lr: 0.00100; 9916/2501 tok/s; 1684 sec
[2019-10-11 16:50:53,950 INFO] Step 128/ 200; acc: 19.00; ppl: 234.93; xent: 5.46; lr: 0.00100; 6302/4029 tok/s; 1690 sec
[2019-10-11 16:50:58,353 INFO] Step 129/ 200; acc: 21.28; ppl: 206.30; xent: 5.33; lr: 0.00100; 10814/4053 tok/s; 1694 sec
[2019-10-11 16:51:11,152 INFO] Step 130/ 200; acc: 19.02; ppl: 256.83; xent: 5.55; lr: 0.00100; 7614/1556 tok/s; 1707 sec
[2019-10-11 16:51:22,289 INFO] Step 131/ 200; acc: 18.22; ppl: 255.58; xent: 5.54; lr: 0.00100; 7105/1904 tok/s; 1718 sec
[2019-10-11 16:51:27,368 INFO] Step 132/ 200; acc: 17.64; ppl: 257.83; xent: 5.55; lr: 0.00100; 7663/5166 tok/s; 1723 sec
[2019-10-11 16:51:34,863 INFO] Step 133/ 200; acc: 17.59; ppl: 258.07; xent: 5.55; lr: 0.00100; 8539/3087 tok/s; 1730 sec
[2019-10-11 16:51:44,652 INFO] Step 134/ 200; acc: 20.96; ppl: 195.46; xent: 5.28; lr: 0.00100; 9363/2193 tok/s; 1740 sec
[2019-10-11 16:51:53,708 INFO] Step 135/ 200; acc: 26.98; ppl: 99.48; xent: 4.60; lr: 0.00100; 7882/2694 tok/s; 1749 sec
[2019-10-11 16:51:58,829 INFO] Step 136/ 200; acc: 25.29; ppl: 177.23; xent: 5.18; lr: 0.00100; 10100/2830 tok/s; 1754 sec
[2019-10-11 16:52:13,188 INFO] Step 137/ 200; acc: 22.14; ppl: 214.41; xent: 5.37; lr: 0.00100; 6430/1199 tok/s; 1769 sec
[2019-10-11 16:52:15,516 INFO] Step 138/ 200; acc: 26.50; ppl: 146.44; xent: 4.99; lr: 0.00100; 6600/6282 tok/s; 1771 sec
[2019-10-11 16:52:33,470 INFO] Step 139/ 200; acc: 16.59; ppl: 277.32; xent: 5.63; lr: 0.00100; 6044/1460 tok/s; 1789 sec
[2019-10-11 16:52:41,500 INFO] Step 140/ 200; acc: 22.92; ppl: 195.03; xent: 5.27; lr: 0.00100; 8211/1986 tok/s; 1797 sec
[2019-10-11 16:52:45,350 INFO] Step 141/ 200; acc: 19.96; ppl: 228.09; xent: 5.43; lr: 0.00100; 8644/5489 tok/s; 1801 sec
[2019-10-11 16:53:10,785 INFO] Step 142/ 200; acc: 18.56; ppl: 288.84; xent: 5.67; lr: 0.00100; 6634/894 tok/s; 1826 sec
[2019-10-11 16:53:26,513 INFO] Step 143/ 200; acc: 17.77; ppl: 275.82; xent: 5.62; lr: 0.00100; 7041/1634 tok/s; 1842 sec
[2019-10-11 16:53:37,211 INFO] Step 144/ 200; acc: 20.82; ppl: 224.97; xent: 5.42; lr: 0.00100; 8090/1875 tok/s; 1853 sec
[2019-10-11 16:53:43,649 INFO] Step 145/ 200; acc: 23.48; ppl: 183.08; xent: 5.21; lr: 0.00100; 11532/2412 tok/s; 1859 sec
[2019-10-11 16:53:50,462 INFO] Step 146/ 200; acc: 22.38; ppl: 200.81; xent: 5.30; lr: 0.00100; 11649/2197 tok/s; 1866 sec
[2019-10-11 16:53:59,694 INFO] Step 147/ 200; acc: 23.03; ppl: 193.15; xent: 5.26; lr: 0.00100; 7872/1868 tok/s; 1875 sec
[2019-10-11 16:54:04,935 INFO] Step 148/ 200; acc: 19.63; ppl: 233.14; xent: 5.45; lr: 0.00100; 9024/2527 tok/s; 1881 sec
[2019-10-11 16:54:10,222 INFO] Step 149/ 200; acc: 17.07; ppl: 278.65; xent: 5.63; lr: 0.00100; 7194/4620 tok/s; 1886 sec
[2019-10-11 16:54:26,840 INFO] Step 150/ 200; acc: 19.32; ppl: 235.92; xent: 5.46; lr: 0.00100; 7171/1253 tok/s; 1902 sec
[2019-10-11 16:54:45,040 INFO] Step 151/ 200; acc: 20.71; ppl: 231.13; xent: 5.44; lr: 0.00100; 5806/971 tok/s; 1921 sec
[2019-10-11 16:54:51,175 INFO] Step 152/ 200; acc: 17.30; ppl: 254.47; xent: 5.54; lr: 0.00100; 7042/4433 tok/s; 1927 sec
[2019-10-11 16:55:16,557 INFO] Step 153/ 200; acc: 17.83; ppl: 266.63; xent: 5.59; lr: 0.00100; 5869/995 tok/s; 1952 sec
[2019-10-11 16:55:32,113 INFO] Step 154/ 200; acc: 19.89; ppl: 221.36; xent: 5.40; lr: 0.00100; 6574/1324 tok/s; 1968 sec
[2019-10-11 16:55:53,389 INFO] Step 155/ 200; acc: 22.75; ppl: 190.89; xent: 5.25; lr: 0.00100; 5032/779 tok/s; 1989 sec
[2019-10-11 16:55:56,537 INFO] Step 156/ 200; acc: 23.11; ppl: 187.39; xent: 5.23; lr: 0.00100; 7971/5535 tok/s; 1992 sec
[2019-10-11 16:56:03,227 INFO] Step 157/ 200; acc: 21.68; ppl: 192.89; xent: 5.26; lr: 0.00100; 8579/4053 tok/s; 1999 sec
[2019-10-11 16:56:24,337 INFO] Step 158/ 200; acc: 18.69; ppl: 241.03; xent: 5.48; lr: 0.00100; 6108/1140 tok/s; 2020 sec
[2019-10-11 16:56:43,108 INFO] Step 159/ 200; acc: 22.41; ppl: 198.86; xent: 5.29; lr: 0.00100; 6299/910 tok/s; 2039 sec
[2019-10-11 16:56:57,835 INFO] Step 160/ 200; acc: 19.75; ppl: 243.38; xent: 5.49; lr: 0.00100; 8041/1414 tok/s; 2053 sec
[2019-10-11 16:57:15,650 INFO] Step 161/ 200; acc: 20.75; ppl: 205.25; xent: 5.32; lr: 0.00100; 7385/1115 tok/s; 2071 sec
[2019-10-11 16:57:27,969 INFO] Step 162/ 200; acc: 18.32; ppl: 242.83; xent: 5.49; lr: 0.00100; 7671/1941 tok/s; 2084 sec
[2019-10-11 16:57:28,063 INFO] Loading dataset from data/data.train.0.pt
[2019-10-11 16:57:43,420 INFO] number of examples: 249382
[2019-10-11 16:58:27,596 INFO] Step 163/ 200; acc: 18.03; ppl: 266.50; xent: 5.59; lr: 0.00100; 3631/388 tok/s; 2143 sec
[2019-10-11 16:58:50,417 INFO] Step 164/ 200; acc: 21.63; ppl: 206.62; xent: 5.33; lr: 0.00100; 5466/828 tok/s; 2166 sec
[2019-10-11 16:58:54,707 INFO] Step 165/ 200; acc: 23.49; ppl: 162.62; xent: 5.09; lr: 0.00100; 10147/3571 tok/s; 2170 sec
[2019-10-11 16:59:01,267 INFO] Step 166/ 200; acc: 18.60; ppl: 239.71; xent: 5.48; lr: 0.00100; 8646/3449 tok/s; 2177 sec
[2019-10-11 16:59:32,055 INFO] Step 167/ 200; acc: 20.54; ppl: 226.46; xent: 5.42; lr: 0.00100; 5487/688 tok/s; 2208 sec
[2019-10-11 16:59:38,953 INFO] Step 168/ 200; acc: 24.15; ppl: 160.65; xent: 5.08; lr: 0.00100; 8761/2444 tok/s; 2215 sec
[2019-10-11 16:59:50,157 INFO] Step 169/ 200; acc: 22.31; ppl: 186.92; xent: 5.23; lr: 0.00100; 6508/2524 tok/s; 2226 sec
[2019-10-11 16:59:53,164 INFO] Step 170/ 200; acc: 24.18; ppl: 122.82; xent: 4.81; lr: 0.00100; 9709/5686 tok/s; 2229 sec
[2019-10-11 17:00:07,017 INFO] Step 171/ 200; acc: 22.80; ppl: 186.99; xent: 5.23; lr: 0.00100; 7565/1233 tok/s; 2243 sec
[2019-10-11 17:00:25,137 INFO] Step 172/ 200; acc: 25.48; ppl: 128.89; xent: 4.86; lr: 0.00100; 6148/1325 tok/s; 2261 sec
[2019-10-11 17:00:32,435 INFO] Step 173/ 200; acc: 21.60; ppl: 175.86; xent: 5.17; lr: 0.00100; 8236/2715 tok/s; 2268 sec
[2019-10-11 17:00:42,283 INFO] Step 174/ 200; acc: 21.82; ppl: 201.46; xent: 5.31; lr: 0.00100; 8926/1768 tok/s; 2278 sec
[2019-10-11 17:00:57,941 INFO] Step 175/ 200; acc: 20.62; ppl: 199.78; xent: 5.30; lr: 0.00100; 7222/1384 tok/s; 2294 sec
[2019-10-11 17:01:14,750 INFO] Step 176/ 200; acc: 31.00; ppl: 106.50; xent: 4.67; lr: 0.00100; 7039/1436 tok/s; 2310 sec
[2019-10-11 17:01:34,296 INFO] Step 177/ 200; acc: 19.72; ppl: 232.48; xent: 5.45; lr: 0.00100; 6693/1099 tok/s; 2330 sec
[2019-10-11 17:01:47,230 INFO] Step 178/ 200; acc: 19.67; ppl: 223.32; xent: 5.41; lr: 0.00100; 7036/1797 tok/s; 2343 sec
[2019-10-11 17:01:54,142 INFO] Step 179/ 200; acc: 24.74; ppl: 162.58; xent: 5.09; lr: 0.00100; 8872/2347 tok/s; 2350 sec
[2019-10-11 17:02:05,372 INFO] Step 180/ 200; acc: 22.28; ppl: 155.89; xent: 5.05; lr: 0.00100; 6718/2012 tok/s; 2361 sec
[2019-10-11 17:02:30,955 INFO] Step 181/ 200; acc: 20.08; ppl: 215.96; xent: 5.38; lr: 0.00100; 6195/858 tok/s; 2387 sec
[2019-10-11 17:02:37,266 INFO] Step 182/ 200; acc: 22.89; ppl: 179.88; xent: 5.19; lr: 0.00100; 9170/3049 tok/s; 2393 sec
[2019-10-11 17:02:44,888 INFO] Step 183/ 200; acc: 22.50; ppl: 176.78; xent: 5.17; lr: 0.00100; 7297/2091 tok/s; 2401 sec
[2019-10-11 17:03:23,846 INFO] Step 184/ 200; acc: 21.19; ppl: 211.07; xent: 5.35; lr: 0.00100; 5039/469 tok/s; 2439 sec
[2019-10-11 17:03:45,086 INFO] Step 185/ 200; acc: 20.07; ppl: 212.28; xent: 5.36; lr: 0.00100; 6688/940 tok/s; 2461 sec
[2019-10-11 17:04:02,288 INFO] Step 186/ 200; acc: 23.09; ppl: 152.95; xent: 5.03; lr: 0.00100; 5879/960 tok/s; 2478 sec
[2019-10-11 17:04:21,065 INFO] Step 187/ 200; acc: 20.64; ppl: 199.71; xent: 5.30; lr: 0.00100; 5858/1172 tok/s; 2497 sec
[2019-10-11 17:04:31,583 INFO] Step 188/ 200; acc: 19.40; ppl: 232.83; xent: 5.45; lr: 0.00100; 7216/2449 tok/s; 2507 sec
[2019-10-11 17:04:41,104 INFO] Step 189/ 200; acc: 21.37; ppl: 192.68; xent: 5.26; lr: 0.00100; 7905/2122 tok/s; 2517 sec
[2019-10-11 17:04:48,819 INFO] Step 190/ 200; acc: 18.10; ppl: 241.92; xent: 5.49; lr: 0.00100; 7383/3925 tok/s; 2524 sec
[2019-10-11 17:04:53,985 INFO] Step 191/ 200; acc: 25.97; ppl: 125.78; xent: 4.83; lr: 0.00100; 9393/3590 tok/s; 2530 sec
[2019-10-11 17:05:09,125 INFO] Step 192/ 200; acc: 21.30; ppl: 191.24; xent: 5.25; lr: 0.00100; 6072/1304 tok/s; 2545 sec
[2019-10-11 17:05:38,464 INFO] Step 193/ 200; acc: 20.21; ppl: 219.72; xent: 5.39; lr: 0.00100; 5815/609 tok/s; 2574 sec
[2019-10-11 17:05:47,114 INFO] Step 194/ 200; acc: 20.07; ppl: 206.41; xent: 5.33; lr: 0.00100; 7888/3029 tok/s; 2583 sec
[2019-10-11 17:06:16,811 INFO] Step 195/ 200; acc: 18.46; ppl: 250.30; xent: 5.52; lr: 0.00100; 5212/887 tok/s; 2612 sec
[2019-10-11 17:06:30,034 INFO] Step 196/ 200; acc: 20.81; ppl: 207.11; xent: 5.33; lr: 0.00100; 8263/1638 tok/s; 2626 sec
[2019-10-11 17:06:38,328 INFO] Step 197/ 200; acc: 22.72; ppl: 160.86; xent: 5.08; lr: 0.00100; 7628/2504 tok/s; 2634 sec
[2019-10-11 17:06:50,604 INFO] Step 198/ 200; acc: 24.12; ppl: 165.11; xent: 5.11; lr: 0.00100; 7465/1330 tok/s; 2646 sec
[2019-10-11 17:07:10,297 INFO] Step 199/ 200; acc: 20.86; ppl: 198.47; xent: 5.29; lr: 0.00100; 6081/1062 tok/s; 2666 sec
[2019-10-11 17:07:21,297 INFO] Step 200/ 200; acc: 20.06; ppl: 221.32; xent: 5.40; lr: 0.00100; 7453/2124 tok/s; 2677 sec
[2019-10-11 17:07:21,300 INFO] Loading dataset from data/data.valid.0.pt
[2019-10-11 17:07:21,955 INFO] number of examples: 31174
[2019-10-11 17:08:36,332 INFO] Validation perplexity: 144.461
[2019-10-11 17:08:36,333 INFO] Validation accuracy: 22.8532
[2019-10-11 17:08:36,333 INFO] Model is improving ppl: 231.081 --> 144.461.
[2019-10-11 17:08:36,333 INFO] Model is improving acc: 18.4887 --> 22.8532.
[2019-10-11 17:08:38,631 INFO] Saving checkpoint data/model_step_200.pt
this Training for 200 steps, once finished the training stopped
i.e I trained my model for 1 epoch only!!!
Edit: after translation I calculate bleu score, then result is:
BLEU = 0.86, 16.1/3.3/1.1/0.5 (BP=0.362, ratio=0.496, hyp_len=190872, ref_len=384994)
how .86 ?

This is not magic. You need processing power to optimize your model.
You can try letting it run a few thousand steps, it should get better (or at least less bad) at inference.
Depending on (true) batch_size, model architecture and size, it can take tens or hundreds of thousands of steps to train a proper model. With such little data it should converge quite quickly, but you still need to be patient, even more so you’re on CPU.
(Rent a GPU instance on AWS or other if you want quicker results.)

thanks sir for helping me
I have hired google cloud machine with single GPU Nvidia Tesla K80 , and my training parameters are:
onmt_train -batch_size 512
-world_size = 1
-gpu_rank 0
-layers 1
-rnn_size 128
-data data/data
-pre_word_vecs_enc “data/embeddings.enc.pt”
-pre_word_vecs_dec “data/embeddings.dec.pt”
-src_word_vec_size 224
-tgt_word_vec_size 336
-fix_word_vecs_enc
-fix_word_vecs_dec
-save_model data/model
-save_checkpoint_steps 100
-train_steps 1000
-model_type text
-encoder_type rnn
-decoder_type rnn
-rnn_type GRU
-global_attention dot
-global_attention_function softmax
-early_stopping 10
-optim adam
-learning_rate 0.001
-valid_steps 100
-dropout .2
-attention_dropout .3
-tensorboard

  1. I want to train about 20 epochs, How can I get it? (training dataset about 190k and test & valid 25k) … I put training_steps with 1000 for experiment
  2. tensorboard does not work :frowning:
  3. In your Opinoin, what about my model parameters?
    Thanks again sir
  1. Already explained this. You can estimate the number of step in an epoch based on your batch size and the size of your data. (Be careful if your batch size is in tokens.)
  1. Please post an issue with the error trace on github.

  2. Depends on your task, data, etc. Have a look at some papers.

Thanks , I could calculate number of epoch required for my model
I have 190000 samples for Training data set and batch size 256 and acc_num 3
then 190000 / 256 = 742.1 then 742.1 / 3 = 247.3 step for one epoch
If I want to train my model for 50 epoch then 247.3*50 = 12370 training_steps
and I noticed this from the training log , it load training data every 247 step

Hello,
I’m training with OpenNMT-tf, would my calculations be correct for an epoch, assuming I have the following parameters?

Training data size = 36000000 sentencepiece tokens
batch_size: 3072
batch_type: tokens
effective_batch_size: 25000
Number of GPUs = 1

Number of steps for a single epoch = Training data size / (Number of GPUs + effective_batch_size)
= 36000000/(25000*1) = 1440 steps/epoch

If you really need to know the number of epochs, run the training epoch by epoch to get an exact value:

https://opennmt.net/OpenNMT-tf/faq.html#how-to-count-the-number-of-epochs

1 Like

I see what you’re saying thanks!

How can I save the checkpoint at the end of a single pass (epoch)?

A checkpoint is always saved at the end of the training.

I got this error:
ValueError: single_pass parameter is not compatible with weighted datasets

If you are using weighted datasets, the concept of epoch no longer exists because some datasets may be more sampled than others.

If you are not using weighted datasets, please post your configuration. This may be a bug.