Problem when training with fp16

silver · January 11, 2018, 4:00pm

Hi,

I meet a problem when training with option -fp16,

My model is trained with Tesla P100, so the train speed will be 2x faster if using fp16, but I cannot get this result,

if I train my model without option fp16:

[01/08/18 19:31:22 INFO] Epoch 1 ; Iteration 450/26079 ; Optim SGD LR 1.000000 ; Source tokens/s 3227 ; Perplexity 26296.91
[01/08/18 19:31:48 INFO] Epoch 1 ; Iteration 500/26079 ; Optim SGD LR 1.000000 ; Source tokens/s 3032 ; Perplexity 43565.00
[01/08/18 19:32:10 INFO] Epoch 1 ; Iteration 550/26079 ; Optim SGD LR 1.000000 ; Source tokens/s 3251 ; Perplexity 30608.28
[01/08/18 19:32:34 INFO] Epoch 1 ; Iteration 600/26079 ; Optim SGD LR 1.000000 ; Source tokens/s 3046 ; Perplexity 27288.81
[01/08/18 19:32:58 INFO] Epoch 1 ; Iteration 650/26079 ; Optim SGD LR 1.000000 ; Source tokens/s 3226 ; Perplexity 9511.65
[01/08/18 19:33:22 INFO] Epoch 1 ; Iteration 700/26079 ; Optim SGD LR 1.000000 ; Source tokens/s 3067 ; Perplexity 3838.16
[01/08/18 19:33:46 INFO] Epoch 1 ; Iteration 750/26079 ; Optim SGD LR 1.000000 ; Source tokens/s 3029 ; Perplexity 1993.95

with option fp16:
[01/11/18 16:17:24 INFO] Epoch 1 ; Iteration 450/26080 ; Optim SGD LR 1.000000 ; Source tokens/s 1767 ; Perplexity nan
[01/11/18 16:18:10 INFO] Epoch 1 ; Iteration 500/26080 ; Optim SGD LR 1.000000 ; Source tokens/s 1769 ; Perplexity nan
[01/11/18 16:18:57 INFO] Epoch 1 ; Iteration 550/26080 ; Optim SGD LR 1.000000 ; Source tokens/s 1745 ; Perplexity nan
[01/11/18 16:19:44 INFO] Epoch 1 ; Iteration 600/26080 ; Optim SGD LR 1.000000 ; Source tokens/s 1751 ; Perplexity nan
[01/11/18 16:20:32 INFO] Epoch 1 ; Iteration 650/26080 ; Optim SGD LR 1.000000 ; Source tokens/s 1774 ; Perplexity nan
[01/11/18 16:21:23 INFO] Epoch 1 ; Iteration 700/26080 ; Optim SGD LR 1.000000 ; Source tokens/s 1752 ; Perplexity nan
[01/11/18 16:22:12 INFO] Epoch 1 ; Iteration 750/26080 ; Optim SGD LR 1.000000 ; Source tokens/s 1777 ; Perplexity nan
[01/11/18 16:22:59 INFO] Epoch 1 ; Iteration 800/26080 ; Optim SGD LR 1.000000 ; Source tokens/s 1794 ; Perplexity nan

With fp16, 50 iterations need 47s, but without fp16, 50 iterations just take 24s
Also with fp16, perplexity will be nan, it’s strange

Anyone has tested fp16, did I make something wrong?

Thanks:)

guillaumekln · January 11, 2018, 4:13pm

Hello,

Maybe you should try a lower learning rate?

Regarding the speed, this is a limitation in Torch that has an inefficient backend for FP16 matrix multiplication. This was due to a missing primitive in cuBLAS which seems to exist now but Torch is no longer updated.

silver · January 11, 2018, 4:18pm

ok, thanks a lot

it’s the same reason for the ‘nan’ perplexity ?

guillaumekln · January 11, 2018, 4:19pm

No, it should be numerically stable but the lower precision can make NaN issues more frequent.

silver · January 11, 2018, 4:23pm

Thanks,

It’s a pity cause half-float has limited influence for performance but much faster training speed

abd · December 9, 2019, 12:47pm

Tesla P100 doesn’t have tensor cores. It shouldn’t give any fast performance than full float.