Inference latency comparison with fastertransformer

The fastertransformer seems to use the OpenNMT pretrained model for benchmarking their e2e translation speed. With GPU V100, beam size 1 and batch size 1, they showed 800 tokens per second float32 and 1000 on float16. When we tried the ctranslate2 with the same setting, the float32 is 700 tokens per second and the float16 is 550 tokens per second.

I have two questions.

  1. Any idea on the float32 latency gap?
  2. Any idea on why float16 in ct2 is adding overhead for batch size 1 while fastertransformer shows a gain?

(Tokens per second is measuring throughput, not latency.)

  1. It’s difficult to tell if the numbers are comparable. If we just compare the speed then all other parameters should be the controlled: same hardware, same CUDA version, same test file, same output, same memory usage, etc. In any case the gap does not appear very big and probably we can make it smaller with additional tuning for batch_size=1.

  2. I don’t think we ever tested batch_size=1 for FP16 so probably there is some tuning to do. In particular, I’m unclear if it is worth trying to enable Tensor Cores at that size. (Tensor Cores require all dimensions to be a multiple of 8).

I added FasterTransformer in our GPU benchmark since they support OpenNMT models. They have a very good FP16 performance, which is expected since the implementation is coming directly from NVIDIA. However, the tested configuration uses a lot more memory than CTranslate2.