Inference latency comparison with fastertransformer

Guoli · May 5, 2021, 10:08pm

NVIDIA/FasterTransformer/blob/main/docs/decoder_guide.md

# FasterTransformer Decoder

The FasterTransformer Decoder contains the transformer decoder block, whole decoding progress, and GPT model.

## Table Of Contents

- [FasterTransformer Decoder](#fastertransformer-decoder)
  - [Table Of Contents](#table-of-contents)
  - [Model architecture](#model-architecture)
    - [Decoder](#decoder)
    - [Decoding progress](#decoding-progress)
    - [Decoder and Decoding](#decoder-and-decoding)
    - [GPT](#gpt)
  - [Setup](#setup)
    - [Requirements](#requirements)
  - [How to use](#how-to-use)
    - [Decoder and decoding process](#decoder-and-decoding-process)
    - [Translation process](#translation-process)
  - [Performance](#performance)
    - [End to end translation performance on TensorFlow](#end-to-end-translation-performance-on-tensorflow)

This file has been truncated. show original

The fastertransformer seems to use the OpenNMT pretrained model for benchmarking their e2e translation speed. With GPU V100, beam size 1 and batch size 1, they showed 800 tokens per second float32 and 1000 on float16. When we tried the ctranslate2 with the same setting, the float32 is 700 tokens per second and the float16 is 550 tokens per second.

I have two questions.

Any idea on the float32 latency gap?
Any idea on why float16 in ct2 is adding overhead for batch size 1 while fastertransformer shows a gain?

guillaumekln · May 6, 2021, 7:02am

(Tokens per second is measuring throughput, not latency.)

It’s difficult to tell if the numbers are comparable. If we just compare the speed then all other parameters should be the controlled: same hardware, same CUDA version, same test file, same output, same memory usage, etc. In any case the gap does not appear very big and probably we can make it smaller with additional tuning for batch_size=1.
I don’t think we ever tested batch_size=1 for FP16 so probably there is some tuning to do. In particular, I’m unclear if it is worth trying to enable Tensor Cores at that size. (Tensor Cores require all dimensions to be a multiple of 8).

guillaumekln · October 20, 2021, 12:35pm

I added FasterTransformer in our GPU benchmark since they support OpenNMT models. They have a very good FP16 performance, which is expected since the implementation is coming directly from NVIDIA. However, the tested configuration uses a lot more memory than CTranslate2.