I’m wondering what accounts for the performance improvement between the OpenMT-py/tf implementations and the baseline CTranslate2 model.
Looking at the benchmarks listed, the baseline model is significantly faster (537.8 tokens per second vs 292.4 tokens per second). I’ve seen similar numbers benchmarking against frameworks like fairseq.
This may be a bit of an over simplification, but it seems like there are roughly two components to each system: a computation graph and the code that wraps it to orchestrate the beam search.
I would have assumed the computation graph has comparable performance in both systems, because they’re both using C++ under the hood and have similar operations. It’s also not clear to me why the python wrapping code would be a bottleneck, since I would have assumed that the vast majority of the computation is from evaluating the model.
Clearly I’m missing something. Would it be possible to get an idea of which components / optimizations are contributing these gains?