Multi-GPUs is slower than single GPU

saz0568 · November 19, 2017, 2:44pm

I use 4 x GTX1080Ti 11GB GPUs.
But I found it’s even slower than only using one single GPU.
My torch version is LUA52. I trained with no nccl.
when in sync parallel mode. The log is like this:

[11/17/17 23:43:33 INFO] Using GPU(s): 1, 2, 3, 4
[11/17/17 23:43:33 WARNING] The caching CUDA memory allocator is enabled. This allocator improves performance at the cost of a higher GPU memory usage. To optimize for memory, consider disabling it by setting the environment variable: THC_CACHING_ALLOCATOR=0
[11/17/17 23:43:34 WARNING] For improved efficiency with multiple GPUs, consider installing nccl
[11/17/17 23:43:34 INFO] Training Sequence to Sequence with Attention model…
[11/17/17 23:43:34 INFO] Loading data from ‘data/patent_train_20171117/demo-train.t7’…
[11/18/17 01:01:21 INFO] * vocabulary size: source = 532157; target = 331638
[11/18/17 01:01:21 INFO] * additional features: source = 0; target = 0
[11/18/17 01:01:21 INFO] * maximum sequence length: source = 80; target = 81
[11/18/17 01:01:21 INFO] * number of training sentences: 40362236
[11/18/17 01:01:21 INFO] * number of batches: 630701
[11/18/17 01:01:21 INFO] - source sequence lengths: equal
[11/18/17 01:01:21 INFO] - maximum size: 64
[11/18/17 01:01:21 INFO] - average size: 64.00
[11/18/17 01:01:21 INFO] - capacity: 100.00%
[11/18/17 01:01:21 INFO] Building model…
[11/18/17 01:01:21 INFO] * Encoder:
[11/18/17 01:02:10 INFO] - word embeddings size: 500
[11/18/17 01:02:10 INFO] - type: unidirectional RNN
[11/18/17 01:02:10 INFO] - structure: cell = LSTM; layers = 2; rnn_size = 500; dropout = 0.3 (naive)
[11/18/17 01:02:11 INFO] * Decoder:
[11/18/17 01:02:23 INFO] - word embeddings size: 500
[11/18/17 01:02:26 INFO] - attention: global (general)
[11/18/17 01:02:26 INFO] - structure: cell = LSTM; layers = 2; rnn_size = 500; dropout = 0.3 (naive)
[11/18/17 01:02:26 INFO] * Bridge: copy
[11/18/17 01:02:48 INFO] Initializing parameters…
[11/18/17 01:03:24 INFO] * number of parameters: 607814138
[11/18/17 01:03:24 INFO] Preparing memory optimization…
[11/18/17 01:03:26 INFO] * sharing 69% of output/gradInput tensors memory between clones
[11/18/17 01:05:19 INFO] Start training from epoch 1 to 13…
[11/18/17 01:05:19 INFO]
[11/18/17 01:13:28 INFO] Epoch 1 ; Iteration 50/157676 ; Optim SGD LR 1.000000 ; Source tokens/s 1111 ; Perplexity 539751.62
[11/18/17 01:17:04 INFO] Epoch 1 ; Iteration 100/157676 ; Optim SGD LR 1.000000 ; Source tokens/s 1384 ; Perplexity 52198.92
[11/18/17 01:20:44 INFO] Epoch 1 ; Iteration 150/157676 ; Optim SGD LR 1.000000 ; Source tokens/s 1447 ; Perplexity 11289.72
[11/18/17 01:24:18 INFO] Epoch 1 ; Iteration 200/157676 ; Optim SGD LR 1.000000 ; Source tokens/s 1494 ; Perplexity 2725.05
[11/18/17 01:27:50 INFO] Epoch 1 ; Iteration 250/157676 ; Optim SGD LR 1.000000 ; Source tokens/s 1444 ; Perplexity 1223.41
[11/18/17 01:31:24 INFO] Epoch 1 ; Iteration 300/157676 ; Optim SGD LR 1.000000 ; Source tokens/s 1439 ; Perplexity 791.47
[11/18/17 01:34:58 INFO] Epoch 1 ; Iteration 350/157676 ; Optim SGD LR 1.000000 ; Source tokens/s 1525 ; Perplexity 518.24
[11/18/17 01:38:31 INFO] Epoch 1 ; Iteration 400/157676 ; Optim SGD LR 1.000000 ; Source tokens/s 1481 ; Perplexity 385.67
[11/18/17 01:42:07 INFO] Epoch 1 ; Iteration 450/157676 ; Optim SGD LR 1.000000 ; Source tokens/s 1530 ; Perplexity 299.30
[11/18/17 01:45:41 INFO] Epoch 1 ; Iteration 500/157676 ; Optim SGD LR 1.000000 ; Source tokens/s 1551 ; Perplexity 259.12
[11/18/17 01:49:12 INFO] Epoch 1 ; Iteration 550/157676 ; Optim SGD LR 1.000000 ; Source tokens/s 1433 ; Perplexity 215.22
[11/18/17 01:52:45 INFO] Epoch 1 ; Iteration 600/157676 ; Optim SGD LR 1.000000 ; Source tokens/s 1479 ; Perplexity 189.00

when in async parallel training mode, it’s like this:

[11/19/17 18:11:51 INFO] Using GPU(s): 1, 2, 3, 4
[11/19/17 18:11:51 WARNING] The caching CUDA memory allocator is enabled. This allocator improves performance at the cost of a higher GPU memory usage. To optimize for memory, consider disabling it by setting the environment variable: THC_CACHING_ALLOCATOR=0
[11/19/17 18:11:52 INFO] Training Sequence to Sequence with Attention model…
[11/19/17 18:11:52 INFO] Loading data from ‘data/patent_train_20171117/demo-train.t7’…
[11/19/17 19:33:13 INFO] * vocabulary size: source = 532157; target = 331638
[11/19/17 19:33:13 INFO] * additional features: source = 0; target = 0
[11/19/17 19:33:13 INFO] * maximum sequence length: source = 80; target = 81
[11/19/17 19:33:13 INFO] * number of training sentences: 40362236
[11/19/17 19:33:13 INFO] * number of batches: 630701
[11/19/17 19:33:13 INFO] - source sequence lengths: equal
[11/19/17 19:33:13 INFO] - maximum size: 64
[11/19/17 19:33:13 INFO] - average size: 64.00
[11/19/17 19:33:13 INFO] - capacity: 100.00%
[11/19/17 19:33:13 INFO] Building model…
[11/19/17 19:33:13 INFO] * Encoder:
[11/19/17 19:34:04 INFO] - word embeddings size: 500
[11/19/17 19:34:04 INFO] - type: unidirectional RNN
[11/19/17 19:34:04 INFO] - structure: cell = LSTM; layers = 2; rnn_size = 500; dropout = 0.3 (naive)
[11/19/17 19:34:05 INFO] * Decoder:
[11/19/17 19:34:18 INFO] - word embeddings size: 500
[11/19/17 19:34:21 INFO] - attention: global (general)
[11/19/17 19:34:21 INFO] - structure: cell = LSTM; layers = 2; rnn_size = 500; dropout = 0.3 (naive)
[11/19/17 19:34:21 INFO] * Bridge: copy
[11/19/17 19:34:47 INFO] Initializing parameters…
[11/19/17 19:35:14 INFO] * number of parameters: 607814138
[11/19/17 19:35:15 INFO] Preparing memory optimization…
[11/19/17 19:35:17 INFO] * sharing 69% of output/gradInput tensors memory between clones
[11/19/17 19:37:14 INFO] Start training from epoch 1 to 13…
[11/19/17 19:37:14 INFO]
[11/19/17 20:01:55 INFO] Epoch 1 ; Iteration 500/630701 ; Optim SGD LR 1.000000 ; Source tokens/s 600 ; Perplexity 2793.52
[11/19/17 20:21:57 INFO] Epoch 1 ; Iteration 1000/630701 ; Optim SGD LR 1.000000 ; Source tokens/s 650 ; Perplexity 177.09
[11/19/17 20:32:46 INFO] Epoch 1 ; Iteration 1500/630701 ; Optim SGD LR 1.000000 ; Source tokens/s 1223 ; Perplexity 258.51
[11/19/17 20:42:44 INFO] Epoch 1 ; Iteration 2000/630701 ; Optim SGD LR 1.000000 ; Source tokens/s 1367 ; Perplexity 135.64
[11/19/17 20:52:41 INFO] Epoch 1 ; Iteration 2500/630701 ; Optim SGD LR 1.000000 ; Source tokens/s 1320 ; Perplexity 94.57
[11/19/17 21:02:52 INFO] Epoch 1 ; Iteration 3000/630701 ; Optim SGD LR 1.000000 ; Source tokens/s 1235 ; Perplexity 66.87
[11/19/17 21:12:57 INFO] Epoch 1 ; Iteration 3500/630701 ; Optim SGD LR 1.000000 ; Source tokens/s 1281 ; Perplexity 50.77
[11/19/17 21:22:58 INFO] Epoch 1 ; Iteration 4000/630701 ; Optim SGD LR 1.000000 ; Source tokens/s 1338 ; Perplexity 42.55
[11/19/17 21:33:14 INFO] Epoch 1 ; Iteration 4500/630701 ; Optim SGD LR 1.000000 ; Source tokens/s 1242 ; Perplexity 36.17
[11/19/17 21:43:14 INFO] Epoch 1 ; Iteration 5000/630701 ; Optim SGD LR 1.000000 ; Source tokens/s 1323 ; Perplexity 32.14

But actually, when I only use 1 GPU, the speed can reach 3000 tokens/s.
So what should be the problem with it?

Thanks.

saz0568 · November 20, 2017, 4:42am

Sorry, I made a mistake.
Actually, the speed of training using single GPU is about 1100 tokens/s, which is a little slower than multi-GPUs.
But the slow speed of multi-GPUs is still an issue.

guillaumekln · November 20, 2017, 8:45am

saz0568 · November 20, 2017, 9:01am

My preprocess model is about 15GB and sentences sequence is under 80.
When training, the memory usage is about 30/32 GB. And 4 GPUs resources use 10/11 GB more or less.
Could increasing batch_size optimize the training on my situation?

From my personal observation, it seems 4 GPUs is not running parallel but only one works at a time.
Thanks for reply.

thanhleha · November 23, 2017, 2:26pm

Can you post the log file for single GPU? And if you can, watch the GPU usage. I haven’t run the async mode but in the sync mode the speed is quite proportional to the number of GPUs. You should consider installing nccl also.

vince62s · November 29, 2017, 8:28am

Hello @thanhleha ,

Would you mind to share on how many GPU you tested sync training, the type of hardware and the size of the model ?

I am quite surprised by the proportional speed you report.

thanks.

thanhleha · November 29, 2017, 9:07pm

Hi @vince62s,

Sometimes I use 2 GPUs, sometimes I use 3 GPUs or even 4 GPUs. We have a GPU node including 6 Geforce GTX 1080i with 11172MiB of GPU RAM and 2 Pascal TITAN X and I always use 1080i GPUs. I have run many models with multi GPU sync mode. For example, this attached log file shows how I trained a big model with 3 GPUs. Both the total training time for 4 epochs and the number of source token processed per second are about 3 times faster than if I trained the same model on the same data with 1 GPU. Notice the suggestions from @guillaumekln. I always choose the minibatch size to be as biggest as possible. I can retrain the network with 1 GPUs for some iterations and post here for comparison purpose. Stay tune!

[08/07/17 22:59:34 INFO] Using GPU(s): 1, 2, 3
[08/07/17 22:59:34 INFO] Training Sequence to Sequence with Attention model with Multi GPU(s)
[08/07/17 22:59:34 INFO] Loading data from ‘saves/sr-tg/tensor-50-train.t7’…
[08/07/17 23:02:14 INFO] * vocabulary size: source = 90004; target = 118631
[08/07/17 23:02:14 INFO] * additional features: source = 0; target = 0
[08/07/17 23:02:14 INFO] * maximum sequence length: source = 50; target = 51
[08/07/17 23:02:14 INFO] * number of training sentences: 4024154
[08/07/17 23:02:14 INFO] * maximum batch size: 100
[08/07/17 23:02:14 INFO] Building model…
[08/07/17 23:02:39 INFO] * Input feeding using concatenation
[08/07/17 23:02:43 INFO] * Input feeding using concatenation
[08/07/17 23:02:47 INFO] * Input feeding using concatenation
[08/07/17 23:02:54 INFO] Initializing parameters…
[08/07/17 23:03:02 INFO] * number of parameters: 369874791
[08/07/17 23:03:02 INFO] Preparing memory optimization…
[08/07/17 23:03:03 INFO] * sharing 69% of output/gradInput tensors memory between clones
[08/07/17 23:04:52 INFO] Initial Validation BLEU score: 0.00
[08/07/17 23:04:52 INFO] Start training…
[08/07/17 23:04:52 INFO]
[08/07/17 23:07:35 INFO] Epoch 1 ; Iteration 100/13421 ; Learning rate 0.0010 ; Source tokens/s 4092 ; Perplexity 1480.93
[08/07/17 23:09:52 INFO] Epoch 1 ; Iteration 200/13421 ; Learning rate 0.0010 ; Source tokens/s 5034 ; Perplexity 427.19
[08/07/17 23:12:05 INFO] Epoch 1 ; Iteration 300/13421 ; Learning rate 0.0010 ; Source tokens/s 5145 ; Perplexity 290.74
[08/07/17 23:14:19 INFO] Epoch 1 ; Iteration 400/13421 ; Learning rate 0.0010 ; Source tokens/s 5117 ; Perplexity 225.81
[08/07/17 23:16:38 INFO] Epoch 1 ; Iteration 500/13421 ; Learning rate 0.0010 ; Source tokens/s 5347 ; Perplexity 199.56
[08/07/17 23:18:55 INFO] Epoch 1 ; Iteration 600/13421 ; Learning rate 0.0010 ; Source tokens/s 5242 ; Perplexity 161.13
[08/07/17 23:21:09 INFO] Epoch 1 ; Iteration 700/13421 ; Learning rate 0.0010 ; Source tokens/s 5121 ; Perplexity 134.56
[08/07/17 23:23:26 INFO] Epoch 1 ; Iteration 800/13421 ; Learning rate 0.0010 ; Source tokens/s 5226 ; Perplexity 124.30
[08/07/17 23:25:38 INFO] Epoch 1 ; Iteration 900/13421 ; Learning rate 0.0010 ; Source tokens/s 5136 ; Perplexity 109.24
[08/07/17 23:27:54 INFO] Epoch 1 ; Iteration 1000/13421 ; Learning rate 0.0010 ; Source tokens/s 5187 ; Perplexity 103.24
[08/07/17 23:30:12 INFO] Epoch 1 ; Iteration 1100/13421 ; Learning rate 0.0010 ; Source tokens/s 5211 ; Perplexity 94.37
[08/07/17 23:32:31 INFO] Epoch 1 ; Iteration 1200/13421 ; Learning rate 0.0010 ; Source tokens/s 5243 ; Perplexity 79.40

vince62s · November 29, 2017, 9:23pm

is this a 2x500 model ?
mother board / proc ?

vince62s · November 29, 2017, 9:26pm

@saz0568 in your case, vocab sizes are crazy. 100K is the high side.

saz0568 · November 30, 2017, 1:57am

@thanhleha @vince62s
Thank you all for your suggestion.
From my oberservation, my model size is quite big, over 40,000,000 sentences and 800,000 vocabs (which is pruned from over 2,500,000 vocabs). And it may cause a problem that the training program occupied too much local memory. The lua program used 28/32 GB of ram and 8GB swap space.
When training, it seems gpus need to get source data from ram, but when ram is used too much, the train source data will stored in swap space. And the swap space(SSD partition) is quite slow compared with ram. It limits the total speed of training.
It seems the only solution is to extend the ram size…And that’s my guess for the speed issue.
btw, I didn’t install nccl. I’ll try it later.

miguelknals · April 16, 2019, 8:14pm

Hi

Not sure I should post it in another place. So… anyway. As here it has been said, I have tried to use a non trivial model, to be exact the transformer model http://opennmt.net/OpenNMT-py/FAQ.html#how-do-i-use-the-transformer-model-do-you-support-multi-gpu. Not sure if this is a complex or not model, but in order to to run the first checkpoint in a 2.1 M lines CA-ES

2 GPU

.
.
[2019-04-13 23:44:34,670 INFO] Step 9950/10000; acc:  94.68; ppl:  1.30; xent: 0.26; lr: 0.00089; 9658/9441 tok/s;  15582 sec
[2019-04-13 23:45:51,517 INFO] Step 10000/10000; acc:  94.47; ppl:  1.31; xent: 0.27; lr: 0.00088; 96  76/9412 tok/s;  15659 sec
[2019-04-13 23:45:51,569 INFO] Loading dataset from CAES.data.valid.0.pt, number of examples: 4997
[2019-04-13 23:46:22,828 INFO] Validation perplexity: 1.97683
[2019-04-13 23:46:22,828 INFO] Validation accuracy: 88.8055
[2019-04-13 23:46:22,829 INFO] Saving checkpoint CAES.model_step_10000.pt

Notice arround 9500 tok/s and 15659 seg to save the 10000 step

Now 1 GPU

 .
 .
 [2019-04-14 03:44:14,151 INFO] Step 10000/200000; acc:  93.56; ppl:  1.40; xent: 0.33; lr: 0.00088; 6060/5964 tok/s;  12296 sec
 [2019-04-14 03:44:14,186 INFO] Loading dataset from CAES.data.valid.0.pt, number of examples: 4997
 [2019-04-14 03:44:39,092 INFO] Validation perplexity: 2.17073
 [2019-04-14 03:44:39,092 INFO] Validation accuracy: 87.4099
 [2019-04-14 03:44:39,093 INFO] Saving checkpoint CAES.model_step_10000.pt

Notice 6000 tok/s and 12296 sec

As you can see, tok/s rises from 6000 to 9500 tok/s, perplexity from 2.17 to 1.97 (?), Accuracy from 87.4 to 88.8 (?) but the time increases from 12296 to 15659 sec (??)

So I would expect only time increase, something that does not happen. I assume that with 2 GPU we would need less steps as looks ppl is lower. Is this the idea? How many less steps? I would expect only time reduction.

Thanks!
Miguel

vince62s · April 16, 2019, 8:27pm

when running on 2 GPU you double the batch size so in fact you only need to run 5000 steps when you compare to 10000 steps on 1 GPU.

miguelknals · April 16, 2019, 9:58pm

Hi!
This makes sense. Thanks for your fast answer.
Have a nice day!