OpenNMT Forum

Some inference benchmarks

Here are a few benchmarks for Transformer inference with CTranslate2 vs OpenNMT-py. This is a first batch of results, this post might be updated.
Inferences for CTranslate2 are performed with the cli interface (ctranslate2/bin/translate).
Inferences for OpenNMT-py are performed with the onmt_translate entry-point.

Speeds are in target tokens per second.

GPU: GTX 1080
Beam size: 4

Batch size 32

Base (en-de) Medium (en-es) Big (en-fr)
OpenNMT-py (1.0.1) 1491 1032 910
CTranslate2 3078 1448 1128
CTranslate2 (int8) 2595 1578 1200

(Not sure where the gap with the V100 benchmarks on OpenNMT-py come from.)

Batch size 16

Base (en-de) Medium (en-es) Big (en-fr)
OpenNMT-py (1.0.1) 1004 706 671
CTranslate2 2693 1378 1157
CTranslate2 (int8) 1992 1397 1114

Batch size 8

Base (en-de) Medium (en-es) Big (en-fr)
OpenNMT-py (1.0.1) 636 464 454
CTranslate2 1915 1029 974
CTranslate2 (int8) 1339 980 840

Thanks for the numbers!

If you have time, could you also add numbers when using int8 (e.g. with --compute_type int8 on the command line)? I found that you can start to see gains with big models.

(Not sure where the gap with the V100 benchmarks on OpenNMT-py come from.)

Following your observation, I checked again on a V100 instance with updated libraries: OpenNMT-py 1.0.1, PyTorch 1.4, CUDA 10.1, driver 440, etc. I got values close to what was originally reported (1037.7 vs 980.4 for a base Transformer). Overall, the GPU usage was quite low for this model (< 40%).

For reference, there are some new benchmark numbers here:

The previous results were not easily reproducible. But this time the script used for the benchmark is published and was executed on AWS instances.

1 Like