Some inference benchmarks

Here are a few benchmarks for Transformer inference with CTranslate2 vs OpenNMT-py. This is a first batch of results, this post might be updated.
Inferences for CTranslate2 are performed with the cli interface (ctranslate2/bin/translate).
Inferences for OpenNMT-py are performed with the onmt_translate entry-point.

Speeds are in target tokens per second.


GPU: GTX 1080
Beam size: 4

Batch size 32

Base (en-de) Medium (en-es) Big (en-fr)
OpenNMT-py (1.0.1) 1491 1032 910
CTranslate2 3078 1448 1128
CTranslate2 (int8) 2595 1578 1200

(Not sure where the gap with the V100 benchmarks on OpenNMT-py come from.)

Batch size 16

Base (en-de) Medium (en-es) Big (en-fr)
OpenNMT-py (1.0.1) 1004 706 671
CTranslate2 2693 1378 1157
CTranslate2 (int8) 1992 1397 1114

Batch size 8

Base (en-de) Medium (en-es) Big (en-fr)
OpenNMT-py (1.0.1) 636 464 454
CTranslate2 1915 1029 974
CTranslate2 (int8) 1339 980 840
2 Likes

Thanks for the numbers!

If you have time, could you also add numbers when using int8 (e.g. with --compute_type int8 on the command line)? I found that you can start to see gains with big models.

(Not sure where the gap with the V100 benchmarks on OpenNMT-py come from.)

Following your observation, I checked again on a V100 instance with updated libraries: OpenNMT-py 1.0.1, PyTorch 1.4, CUDA 10.1, driver 440, etc. I got values close to what was originally reported (1037.7 vs 980.4 for a base Transformer). Overall, the GPU usage was quite low for this model (< 40%).

For reference, there are some new benchmark numbers here.

The previous results were not easily reproducible. But this time the script used for the benchmark is published and was executed on AWS instances.

1 Like