Question about the performance between one GPU and two GPUs

Hi,everyone, I want to use two GPSs to increase the speed of training progress.

Here is the One GPU performance

420 [2019-05-08 23:37:08,581 INFO] Loading dataset from data/train.0.pt, number of examples: 999323
421 [2019-05-05 01:12:36,729 INFO] Step 50/200000; acc:   5.07; ppl: 10921.68; xent: 9.30; lr: 0.00001; 3786/4753 tok/s;     77 sec
422 [2019-05-05 01:13:23,140 INFO] Step 100/200000; acc:   9.77; ppl: 7141.37; xent: 8.87; lr: 0.00001; 6403/8054 tok/s;    123 sec
423 [2019-05-05 01:14:09,618 INFO] Step 150/200000; acc:   9.34; ppl: 4708.65; xent: 8.46; lr: 0.00002; 6201/7626 tok/s;    170 sec
424 [2019-05-05 01:14:56,062 INFO] Step 200/200000; acc:  10.11; ppl: 2552.89; xent: 7.84; lr: 0.00002; 6387/7871 tok/s;    216 sec
425 [2019-05-05 01:15:42,396 INFO] Step 250/200000; acc:  10.02; ppl: 1241.02; xent: 7.12; lr: 0.00003; 6326/7945 tok/s;    263 sec
426 [2019-05-05 01:16:28,955 INFO] Step 300/200000; acc:   9.97; ppl: 592.81; xent: 6.38; lr: 0.00004; 6286/8080 tok/s;    309 sec
427 [2019-05-05 01:17:15,858 INFO] Step 350/200000; acc:  10.47; ppl: 313.09; xent: 5.75; lr: 0.00004; 6191/8077 tok/s;    356 sec
428 [2019-05-05 01:18:02,184 INFO] Step 400/200000; acc:  17.62; ppl: 210.91; xent: 5.35; lr: 0.00005; 6213/8077 tok/s;    402 sec
429 [2019-05-05 01:18:48,703 INFO] Step 450/200000; acc:  18.31; ppl: 190.01; xent: 5.25; lr: 0.00006; 6271/7894 tok/s;    449 sec
430 [2019-05-05 01:19:35,410 INFO] Step 500/200000; acc:  20.23; ppl: 157.04; xent: 5.06; lr: 0.00006; 6136/7865 tok/s;    496 sec
431 [2019-05-05 01:20:22,061 INFO] Step 550/200000; acc:  20.17; ppl: 147.45; xent: 4.99; lr: 0.00007; 6389/7876 tok/s;    542 sec
432 [2019-05-05 01:21:11,340 INFO] Step 600/200000; acc:  20.15; ppl: 136.41; xent: 4.92; lr: 0.00007; 5975/7619 tok/s;    592 sec
433 [2019-05-05 01:21:57,888 INFO] Step 650/200000; acc:  21.36; ppl: 124.90; xent: 4.83; lr: 0.00008; 6149/7988 tok/s;    638 sec
434 [2019-05-05 01:22:44,807 INFO] Step 700/200000; acc:  23.49; ppl: 109.06; xent: 4.69; lr: 0.00009; 6287/7985 tok/s;    685 sec
435 [2019-05-05 01:23:31,635 INFO] Step 750/200000; acc:  24.36; ppl: 101.23; xent: 4.62; lr: 0.00009; 5995/7971 tok/s;    732 sec
436 [2019-05-05 01:24:18,822 INFO] Step 800/200000; acc:  25.23; ppl: 93.80; xent: 4.54; lr: 0.00010; 6105/7809 tok/s;    779 sec
437 [2019-05-05 01:25:05,297 INFO] Step 850/200000; acc:  26.33; ppl: 87.17; xent: 4.47; lr: 0.00011; 6445/7877 tok/s;    826 sec
438 [2019-05-05 01:25:51,836 INFO] Step 900/200000; acc:  27.40; ppl: 79.79; xent: 4.38; lr: 0.00011; 6421/7871 tok/s;    872 sec
439 [2019-05-05 01:26:38,635 INFO] Step 950/200000; acc:  28.88; ppl: 71.63; xent: 4.27; lr: 0.00012; 6417/7892 tok/s;    919 sec
440 [2019-05-05 01:27:25,453 INFO] Step 1000/200000; acc:  30.02; ppl: 67.02; xent: 4.21; lr: 0.00012; 6286/7852 tok/s;    966 sec
441 [2019-05-05 01:28:12,822 INFO] Step 1050/200000; acc:  31.26; ppl: 60.86; xent: 4.11; lr: 0.00013; 5993/7976 tok/s;   1013 sec
442 [2019-05-05 01:28:58,988 INFO] Step 1100/200000; acc:  32.91; ppl: 57.68; xent: 4.05; lr: 0.00014; 6368/7723 tok/s;   1059 sec
443 [2019-05-05 01:29:46,642 INFO] Step 1150/200000; acc:  33.07; ppl: 54.64; xent: 4.00; lr: 0.00014; 5901/7647 tok/s;   1107 sec
444 [2019-05-05 01:30:33,373 INFO] Step 1200/200000; acc:  35.44; ppl: 47.86; xent: 3.87; lr: 0.00015; 6099/7630 tok/s;   1154 sec
445 [2019-05-05 01:31:20,208 INFO] Step 1250/200000; acc:  35.87; ppl: 46.08; xent: 3.83; lr: 0.00015; 6133/7744 tok/s;   1200 sec
446 [2019-05-05 01:32:06,895 INFO] Step 1300/200000; acc:  35.63; ppl: 45.50; xent: 3.82; lr: 0.00016; 6226/7783 tok/s;   1247 sec
447 [2019-05-05 01:32:54,012 INFO] Step 1350/200000; acc:  37.14; ppl: 40.51; xent: 3.70; lr: 0.00017; 6073/7913 tok/s;   1294 sec
448 [2019-05-05 01:33:49,719 INFO] Loading dataset from data/train.1.pt, number of examples: 999623

And Here is the Two GPUs performance

8919 [2019-05-08 23:37:08,581 INFO] Loading dataset from data/train.0.pt, number of examples: 999323
8920 [2019-05-08 23:38:02,515 INFO] Step 50/200000; acc:   4.12; ppl: 11053.49; xent: 9.31; lr: 0.00001; 9210/11318 tok/s;     64 sec
8921 [2019-05-08 23:38:53,794 INFO] Step 100/200000; acc:   9.50; ppl: 7619.20; xent: 8.94; lr: 0.00001; 11314/14351 tok/s;    116 sec
8922 [2019-05-08 23:39:47,400 INFO] Step 150/200000; acc:  10.01; ppl: 4798.87; xent: 8.48; lr: 0.00002; 10782/13932 tok/s;    169 sec
8923 [2019-05-08 23:40:39,512 INFO] Step 200/200000; acc:   9.77; ppl: 2491.54; xent: 7.82; lr: 0.00002; 11288/14067 tok/s;    221 sec
8924 [2019-05-08 23:41:31,427 INFO] Step 250/200000; acc:  10.45; ppl: 1119.18; xent: 7.02; lr: 0.00003; 11082/14314 tok/s;    273 sec
8925 [2019-05-08 23:42:24,040 INFO] Step 300/200000; acc:  11.51; ppl: 525.24; xent: 6.26; lr: 0.00004; 11232/13977 tok/s;    326 sec
8926 [2019-05-08 23:43:15,982 INFO] Step 350/200000; acc:  13.98; ppl: 291.70; xent: 5.68; lr: 0.00004; 11198/14307 tok/s;    378 sec
8927 [2019-05-08 23:44:07,772 INFO] Step 400/200000; acc:  17.87; ppl: 211.53; xent: 5.35; lr: 0.00005; 11234/14174 tok/s;    430 sec
8928 [2019-05-08 23:45:01,003 INFO] Step 450/200000; acc:  18.52; ppl: 183.83; xent: 5.21; lr: 0.00006; 10860/13699 tok/s;    483 sec
8929 [2019-05-08 23:45:53,111 INFO] Step 500/200000; acc:  19.34; ppl: 161.72; xent: 5.09; lr: 0.00006; 11300/14327 tok/s;    535 sec
8930 [2019-05-08 23:46:45,070 INFO] Step 550/200000; acc:  19.91; ppl: 145.59; xent: 4.98; lr: 0.00007; 11130/14225 tok/s;    587 sec
8931 [2019-05-08 23:47:38,305 INFO] Step 600/200000; acc:  20.73; ppl: 128.68; xent: 4.86; lr: 0.00007; 10839/13948 tok/s;    640 sec
8932 [2019-05-08 23:48:30,341 INFO] **Step 650**/200000; acc:  22.05; ppl: 118.28; xent: 4.77; lr: 0.00008; 11080/14140 tok/s;    692 sec
8933 [2019-05-08 23:49:20,753 INFO] Loading dataset from data/train.1.pt, number of examples: 999623

The question is:

  1. Why the time cost as same as one or two GPUs per 50steps?
  2. I have found that, when two GPUs, the train.0.pt just use about half steps of One GPU(650:1350); but why the ppl is not same where the train.0.pt finished?

The last steps for train.0.pt, and the ppl:
447 [2019-05-05 01:32:54,012 INFO] Step 1350/200000; acc: 37.14; ppl: 40.51; xent: 3.70; lr: 0.00017; 6073/7913 tok/s; 1294 sec
8932 [2019-05-08 23:48:30,341 INFO] Step 650/200000; acc: 22.05; ppl: 118.28; xent: 4.77; lr: 0.00008; 11080/14140 tok/s; 692 sec

Hi everyone, I think I found it.
The two GPSs indeed speed up the rate of convergence:

when I use 2 GPUs, the ppl cost half steps of 1GPU.
But the ppl lost some quality of convergence: 5.98 -> 6.47

Hi,

This has been covered several times. When using 2 GPUs, each training step processes 2x more data.