Some inference benchmarks

francoishernandez · February 26, 2020, 1:28pm

Here are a few benchmarks for Transformer inference with CTranslate2 vs OpenNMT-py. This is a first batch of results, this post might be updated.
Inferences for CTranslate2 are performed with the cli interface (ctranslate2/bin/translate).
Inferences for OpenNMT-py are performed with the onmt_translate entry-point.

Speeds are in target tokens per second.

GPU: GTX 1080
Beam size: 4

Batch size 32

	Base (en-de)	Medium (en-es)	Big (en-fr)
OpenNMT-py (1.0.1)	1491	1032	910
CTranslate2	3078	1448	1128
CTranslate2 (int8)	2595	1578	1200

(Not sure where the gap with the V100 benchmarks on OpenNMT-py come from.)

Batch size 16

	Base (en-de)	Medium (en-es)	Big (en-fr)
OpenNMT-py (1.0.1)	1004	706	671
CTranslate2	2693	1378	1157
CTranslate2 (int8)	1992	1397	1114

Batch size 8

	Base (en-de)	Medium (en-es)	Big (en-fr)
OpenNMT-py (1.0.1)	636	464	454
CTranslate2	1915	1029	974
CTranslate2 (int8)	1339	980	840

guillaumekln · February 26, 2020, 2:22pm

Thanks for the numbers!

If you have time, could you also add numbers when using int8 (e.g. with --compute_type int8 on the command line)? I found that you can start to see gains with big models.

(Not sure where the gap with the V100 benchmarks on OpenNMT-py come from.)

Following your observation, I checked again on a V100 instance with updated libraries: OpenNMT-py 1.0.1, PyTorch 1.4, CUDA 10.1, driver 440, etc. I got values close to what was originally reported (1037.7 vs 980.4 for a base Transformer). Overall, the GPU usage was quite low for this model (< 40%).

guillaumekln · October 23, 2020, 10:04am

For reference, there are some new benchmark numbers here.

The previous results were not easily reproducible. But this time the script used for the benchmark is published and was executed on AWS instances.