I’ve trained a es->en
model using TransformerAAN
model architecture. It’s working as expected on CPU/GPU. I’m trying to reduce the translation latency of my model and for that I’ve converted my float32 model checkpoint to float16 using onmt-convert-checkpoint
. After comparing translation latency of both; I can’t see any significant improvement. Any suggestion or advice?
We usually don’t see much improvements when using FP16 during inference, at least for seq2seq architectures.
Do you have numbers to share?
I see… well, that’s also what I thought. I can’t see that much speedup during inference. My last test comparing FP32 & FP16 model:
- Model: es -> en
- Architecture: TransformerAAN
- Batch size: 100 sentences
- Average length of sentence: 150 characters (or 20 tokens)
* AWS instance: c5.2xlarge (8 Core) + eia1.xlarge (1 GPU)
- FP16 model inference time: 2.24 seconds
- FP32 model inference time: 2.67 seconds
* AWS instance: m4.2xlarge (8 Core, No GPU)
- FP16 model inference time: 14.9 seconds
- FP32 model inference time: 17.6 seconds