Hi,
I meet a problem when training with option -fp16,
My model is trained with Tesla P100, so the train speed will be 2x faster if using fp16, but I cannot get this result,
if I train my model without option fp16:
[01/08/18 19:31:22 INFO] Epoch 1 ; Iteration 450/26079 ; Optim SGD LR 1.000000 ; Source tokens/s 3227 ; Perplexity 26296.91
[01/08/18 19:31:48 INFO] Epoch 1 ; Iteration 500/26079 ; Optim SGD LR 1.000000 ; Source tokens/s 3032 ; Perplexity 43565.00
[01/08/18 19:32:10 INFO] Epoch 1 ; Iteration 550/26079 ; Optim SGD LR 1.000000 ; Source tokens/s 3251 ; Perplexity 30608.28
[01/08/18 19:32:34 INFO] Epoch 1 ; Iteration 600/26079 ; Optim SGD LR 1.000000 ; Source tokens/s 3046 ; Perplexity 27288.81
[01/08/18 19:32:58 INFO] Epoch 1 ; Iteration 650/26079 ; Optim SGD LR 1.000000 ; Source tokens/s 3226 ; Perplexity 9511.65
[01/08/18 19:33:22 INFO] Epoch 1 ; Iteration 700/26079 ; Optim SGD LR 1.000000 ; Source tokens/s 3067 ; Perplexity 3838.16
[01/08/18 19:33:46 INFO] Epoch 1 ; Iteration 750/26079 ; Optim SGD LR 1.000000 ; Source tokens/s 3029 ; Perplexity 1993.95
with option fp16:
[01/11/18 16:17:24 INFO] Epoch 1 ; Iteration 450/26080 ; Optim SGD LR 1.000000 ; Source tokens/s 1767 ; Perplexity nan
[01/11/18 16:18:10 INFO] Epoch 1 ; Iteration 500/26080 ; Optim SGD LR 1.000000 ; Source tokens/s 1769 ; Perplexity nan
[01/11/18 16:18:57 INFO] Epoch 1 ; Iteration 550/26080 ; Optim SGD LR 1.000000 ; Source tokens/s 1745 ; Perplexity nan
[01/11/18 16:19:44 INFO] Epoch 1 ; Iteration 600/26080 ; Optim SGD LR 1.000000 ; Source tokens/s 1751 ; Perplexity nan
[01/11/18 16:20:32 INFO] Epoch 1 ; Iteration 650/26080 ; Optim SGD LR 1.000000 ; Source tokens/s 1774 ; Perplexity nan
[01/11/18 16:21:23 INFO] Epoch 1 ; Iteration 700/26080 ; Optim SGD LR 1.000000 ; Source tokens/s 1752 ; Perplexity nan
[01/11/18 16:22:12 INFO] Epoch 1 ; Iteration 750/26080 ; Optim SGD LR 1.000000 ; Source tokens/s 1777 ; Perplexity nan
[01/11/18 16:22:59 INFO] Epoch 1 ; Iteration 800/26080 ; Optim SGD LR 1.000000 ; Source tokens/s 1794 ; Perplexity nan
With fp16, 50 iterations need 47s, but without fp16, 50 iterations just take 24s
Also with fp16, perplexity will be nan, it’s strange
Anyone has tested fp16, did I make something wrong?
Thanks:)