Hello Community
We are happy to release v3.4.3 with very fast beam search inference.
In essence we more than doubled the inference speed.
Some numbers:
v3.0.3 (Feb 2023)
Batch size | tok/sec | Time | Memory |
---|---|---|---|
32 | 2733 | 33.8 | 940M |
64 | 4305 | 22.2 | 1.6G |
128 | 6296 | 15.8 | 2.7G |
256 | 8002 | 12.8 | 4.5G |
512 | 8836 | 11.8 | 5.6G |
960 | 8805 | 11.8 | 9.9G |
v3.3.0 (June 2023)
Batch size | tok/sec | Time | Memory |
---|---|---|---|
32 | 2520 | 36.0 | 990M |
64 | 3880 | 24.0 | 1.7G |
128 | 5591 | 17.2 | 2.9G |
256 | 7232 | 13.6 | 4.4G |
512 | 7934 | 12.6 | 5.4G |
960 | 7966 | 12.5 | 9.5G |
v3.4.3 (Nov 2023)
Batch size | tok/sec | Time | Memory |
---|---|---|---|
32 | 5853 | 16.8 | 990M |
64 | 10249 | 10.4 | 1.1G |
128 | 15025 | 7.8 | 2.0G |
256 | 18667 | 6.6 | 2.7G |
512 | 20319 | 6.3 | 5.9G |
960 | 21027 | 6.1 | 8.9G |
All these numbers were run on a RTX4090 for a vanilla EN-DE base transformer.
The test set is 3003 sentences from WMT14, using a beam_size of 4.
A few comments:
The reported tok/sec is calculated out of the translator it does not count for:
- Python interpreter loading / terminate (about 1.5 sec on my system)
- Model loading (0.4 sec on my system)
I ran the same with CT2:
with a batch size of 960 examples, it takes 2.3 sec. To be fair we need to remove the python loading/termination (so 6.1 sec - 1.5 sec = 4.6 sec)
So OpenNMT-py is still twice slower than CT2 and 3 times slower at batch size 32.