Does converting trained model from FP32 to FP16 improve inference performance?

spatel · April 29, 2019, 4:33pm

I’ve trained a es->en model using TransformerAAN model architecture. It’s working as expected on CPU/GPU. I’m trying to reduce the translation latency of my model and for that I’ve converted my float32 model checkpoint to float16 using onmt-convert-checkpoint. After comparing translation latency of both; I can’t see any significant improvement. Any suggestion or advice?

guillaumekln · April 29, 2019, 7:39pm

We usually don’t see much improvements when using FP16 during inference, at least for seq2seq architectures.

Do you have numbers to share?

spatel · April 30, 2019, 1:14am

I see… well, that’s also what I thought. I can’t see that much speedup during inference. My last test comparing FP32 & FP16 model:

Model: es -> en
Architecture: TransformerAAN
Batch size: 100 sentences
Average length of sentence: 150 characters (or 20 tokens)

* AWS instance: c5.2xlarge (8 Core) + eia1.xlarge (1 GPU)

- FP16 model inference time: 2.24 seconds
- FP32 model inference time: 2.67 seconds

* AWS instance: m4.2xlarge (8 Core, No GPU)

- FP16 model inference time: 14.9 seconds
- FP32 model inference time: 17.6 seconds