Does converting trained model from FP32 to FP16 improve inference performance?

I’ve trained a es->en model using TransformerAAN model architecture. It’s working as expected on CPU/GPU. I’m trying to reduce the translation latency of my model and for that I’ve converted my float32 model checkpoint to float16 using onmt-convert-checkpoint. After comparing translation latency of both; I can’t see any significant improvement. Any suggestion or advice?

We usually don’t see much improvements when using FP16 during inference, at least for seq2seq architectures.

Do you have numbers to share?

I see… well, that’s also what I thought. I can’t see that much speedup during inference. My last test comparing FP32 & FP16 model:

  • Model: es -> en
  • Architecture: TransformerAAN
  • Batch size: 100 sentences
  • Average length of sentence: 150 characters (or 20 tokens)
* AWS instance: c5.2xlarge (8 Core) + eia1.xlarge (1 GPU)

- FP16 model inference time: 2.24 seconds
- FP32 model inference time: 2.67 seconds
* AWS instance: m4.2xlarge (8 Core, No GPU)

- FP16 model inference time: 14.9 seconds
- FP32 model inference time: 17.6 seconds