OpenNMT-py v3.4.3 released - blazing fast beam search inference

vince62s · November 2, 2023, 1:22pm

Hello Community

We are happy to release v3.4.3 with very fast beam search inference.
In essence we more than doubled the inference speed.

Some numbers:
v3.0.3 (Feb 2023)

Batch size	tok/sec	Time	Memory
32	2733	33.8	940M
64	4305	22.2	1.6G
128	6296	15.8	2.7G
256	8002	12.8	4.5G
512	8836	11.8	5.6G
960	8805	11.8	9.9G

v3.3.0 (June 2023)

Batch size	tok/sec	Time	Memory
32	2520	36.0	990M
64	3880	24.0	1.7G
128	5591	17.2	2.9G
256	7232	13.6	4.4G
512	7934	12.6	5.4G
960	7966	12.5	9.5G

v3.4.3 (Nov 2023)

Batch size	tok/sec	Time	Memory
32	5853	16.8	990M
64	10249	10.4	1.1G
128	15025	7.8	2.0G
256	18667	6.6	2.7G
512	20319	6.3	5.9G
960	21027	6.1	8.9G

All these numbers were run on a RTX4090 for a vanilla EN-DE base transformer.
The test set is 3003 sentences from WMT14, using a beam_size of 4.
A few comments:

The reported tok/sec is calculated out of the translator it does not count for:

Python interpreter loading / terminate (about 1.5 sec on my system)
Model loading (0.4 sec on my system)

I ran the same with CT2:
with a batch size of 960 examples, it takes 2.3 sec. To be fair we need to remove the python loading/termination (so 6.1 sec - 1.5 sec = 4.6 sec)
So OpenNMT-py is still twice slower than CT2 and 3 times slower at batch size 32.

vince62s · November 2, 2023, 1:24pm

ymoslem · November 2, 2023, 6:46pm

Many thanks, Vincent, for this update!

I have two questions, please:
1- What is the beam size used in these tests?
2- Any tips for making the inference that fast?

Thanks!
Yasmin

vince62s · November 2, 2023, 7:15pm

I edited the post to be specific on the dataset and beam
A bunch of optimizations everywhere, I used the pytorch profile to analyze the bottlnecks. See the new optim “profile” in translate.py

Everything is between v3.4.1 and v3.4.3

There could be more later.

vince62s · June 15, 2024, 5:16pm