OpenNMT-py v3.4.3 released - blazing fast beam search inference

Hello Community

We are happy to release v3.4.3 with very fast beam search inference.
In essence we more than doubled the inference speed.

Some numbers:
v3.0.3 (Feb 2023)

Batch size tok/sec Time Memory
32 2733 33.8 940M
64 4305 22.2 1.6G
128 6296 15.8 2.7G
256 8002 12.8 4.5G
512 8836 11.8 5.6G
960 8805 11.8 9.9G

v3.3.0 (June 2023)

Batch size tok/sec Time Memory
32 2520 36.0 990M
64 3880 24.0 1.7G
128 5591 17.2 2.9G
256 7232 13.6 4.4G
512 7934 12.6 5.4G
960 7966 12.5 9.5G

v3.4.3 (Nov 2023)

Batch size tok/sec Time Memory
32 5853 16.8 990M
64 10249 10.4 1.1G
128 15025 7.8 2.0G
256 18667 6.6 2.7G
512 20319 6.3 5.9G
960 21027 6.1 8.9G

All these numbers were run on a RTX4090 for a vanilla EN-DE base transformer.
The test set is 3003 sentences from WMT14, using a beam_size of 4.
A few comments:

The reported tok/sec is calculated out of the translator it does not count for:

  • Python interpreter loading / terminate (about 1.5 sec on my system)
  • Model loading (0.4 sec on my system)

I ran the same with CT2:
with a batch size of 960 examples, it takes 2.3 sec. To be fair we need to remove the python loading/termination (so 6.1 sec - 1.5 sec = 4.6 sec)
So OpenNMT-py is still twice slower than CT2 and 3 times slower at batch size 32.

3 Likes

Many thanks, Vincent, for this update!

I have two questions, please:
1- What is the beam size used in these tests?
2- Any tips for making the inference that fast?

Thanks!
Yasmin

1 Like
  1. I edited the post to be specific on the dataset and beam

  2. A bunch of optimizations everywhere, I used the pytorch profile to analyze the bottlnecks. See the new optim “profile” in translate.py

Everything is between v3.4.1 and v3.4.3

There could be more later.

2 Likes