Explanation of performance increase from baseline CTranslate2 model?

mwiethoff · December 3, 2020, 5:20pm

I’m wondering what accounts for the performance improvement between the OpenMT-py/tf implementations and the baseline CTranslate2 model.

Looking at the benchmarks listed, the baseline model is significantly faster (537.8 tokens per second vs 292.4 tokens per second). I’ve seen similar numbers benchmarking against frameworks like fairseq.

This may be a bit of an over simplification, but it seems like there are roughly two components to each system: a computation graph and the code that wraps it to orchestrate the beam search.

I would have assumed the computation graph has comparable performance in both systems, because they’re both using C++ under the hood and have similar operations. It’s also not clear to me why the python wrapping code would be a bottleneck, since I would have assumed that the vast majority of the computation is from evaluating the model.

Clearly I’m missing something. Would it be possible to get an idea of which components / optimizations are contributing these gains?

guillaumekln · December 4, 2020, 9:03am

Hi,

There are multiple reasons (non exhaustive list):

Many operations in CTranslate2 are backed by Intel MKL which offers unmatched performance on Intel CPUs. PyTorch and TensorFlow may also use MKL but not to the same level than CTranslate2. For example we use MKL for vector math (vector addition, multiplication, etc.) which is typically not the case in PyTorch or TensorFlow.
For batches with variable length sequences, CTranslate2 is able to ignore padding positions in several layers and thus reduces the number of operations.
We only support specific models so we can apply optimizations that general purpose frameworks can’t afford:
- Reuse memory buffers to avoid reallocation/copies
- Apply some transformations in-place
- Skip many input checks within operations
- Optimize for specific shapes and dimensions
Some operations are fused together, e.g.: masked softmax, queries/keys/values projection in multi-head attention, etc.
There are some tricks in CTranslate2’s beam search implementation to minimize reordering and copy of the decoder state.

In OpenNMT-py, beam search involves a lot of Python operations. I don’t know exactly how much it costs but surely it has a non negligible impact.

In OpenNMT-tf, the whole beam search logic is actually encoded in a graph. However, it is still slower mostly because it does not implement the removal of finished translations from the batch which is done by both OpenNMT-py and CTranslate2.

Note that we only mentioned beam search here, which is typically a hard task for PyTorch or TensorFlow. When using greedy search the performance gain is usually smaller and you should use quantization to go further.

mwiethoff · December 8, 2020, 6:45pm

Thanks for the thorough response!

I also noticed your comments around possibly supporting ONNX in the future.

Do you have a sense for how many of these TensorFlow / PyTorch performance issues also apply to ONNX? More generally, do you have an idea for the overall performance hit (if there is one)?

guillaumekln · December 9, 2020, 4:22pm

I don’t have much experience with ONNX runtimes, but they should have many similarities with the TensorFlow runtime.

Most of the performance gain of CTranslate2 is about being less generic and more specific to Transformer models and text generation. So the same model executed with ONNX should be slower than CTranslate2.

The idea behind ONNX support would first be to support more model architectures. Do you have a specific architecture in mind?