Hello Community
We are happy to release v3.4.3 with very fast beam search inference.
In essence we more than doubled the inference speed.
Some numbers:
v3.0.3 (Feb 2023)
| Batch size | tok/sec | Time | Memory | 
|---|---|---|---|
| 32 | 2733 | 33.8 | 940M | 
| 64 | 4305 | 22.2 | 1.6G | 
| 128 | 6296 | 15.8 | 2.7G | 
| 256 | 8002 | 12.8 | 4.5G | 
| 512 | 8836 | 11.8 | 5.6G | 
| 960 | 8805 | 11.8 | 9.9G | 
v3.3.0 (June 2023)
| Batch size | tok/sec | Time | Memory | 
|---|---|---|---|
| 32 | 2520 | 36.0 | 990M | 
| 64 | 3880 | 24.0 | 1.7G | 
| 128 | 5591 | 17.2 | 2.9G | 
| 256 | 7232 | 13.6 | 4.4G | 
| 512 | 7934 | 12.6 | 5.4G | 
| 960 | 7966 | 12.5 | 9.5G | 
v3.4.3 (Nov 2023)
| Batch size | tok/sec | Time | Memory | 
|---|---|---|---|
| 32 | 5853 | 16.8 | 990M | 
| 64 | 10249 | 10.4 | 1.1G | 
| 128 | 15025 | 7.8 | 2.0G | 
| 256 | 18667 | 6.6 | 2.7G | 
| 512 | 20319 | 6.3 | 5.9G | 
| 960 | 21027 | 6.1 | 8.9G | 
All these numbers were run on a RTX4090 for a vanilla EN-DE base transformer.
The test set is 3003 sentences from WMT14, using a beam_size of 4.
A few comments:
The reported tok/sec is calculated out of the translator it does not count for:
- Python interpreter loading / terminate (about 1.5 sec on my system)
 - Model loading (0.4 sec on my system)
 
I ran the same with CT2:
with a batch size of 960 examples, it takes 2.3 sec. To be fair we need to remove the python loading/termination (so 6.1 sec - 1.5 sec = 4.6 sec)
So OpenNMT-py is still twice slower than CT2 and 3 times slower at batch size 32.