Low performance on Inference with the transformer model on single GPU


I just trained a model on Sanskrit using the transformer_1gpu settings. Training worked well and the output quality is very good, so far we are happy with the result. However on inference the performance is very low. It takes about one hour to infer 4mb of text. our previous CNN-based model, which took significantly longer to train and has a lower overall accuracy, takes less then two minutes to infer 4MB. During inference the GPU is certainly used and batch_size is as high as possible (using higher settings results in OOM). Any ideas about what is going wrong here?


What is the batch size you use?

If that is possible in your use case, consider sorting the file to translate based on the number of tokens.You’ll get much better performance, especially with large batch sizes.

We might do it transparently for file based inference in the future.

Thank you very much for your reply. The batch size is 32, the used GPU is a GTX Titan X with 12GB.
With “sorting based on the number of tokens” you mean sort from the smallest number to the highest?
That should be possible, it’s no problem to write a script that re-stores the original order later onwards.
And do you know if the same performance is to be expected by programming this in tensor2tensor directly? I am right know wondering whether it is a good idea to switch the code base in case it can give us some improvements.

Yes, or the opposite. The goal is to make sequences of similar lengths next to each other.

Tensor2Tensor is the reference implementation so it should be slightly better in accuracy and speed (after applying this reordering trick). If the T2T workflow works for you, of course you should use it. The Transformer is just one of the model available in OpenNMT.

Thank you very much!