I am trying to reduce the inference time on openNMT model. I had trained the model with mixed precision using APEX and want to infer in fp16 while using tensor core. So, after trying this out, I found roughly 10% speed improvement than when inferring on fp32 model. Has anyone tried this out and seen this behavior? Should I be seeing more improvements. If not, what part of the code should be modified so that tensor cores are used. Right now, by taking a look at the model summary looks like the shapes of most of the convolutional layers are multiples of 8 and thus, are using tensor cores. I am concerned about the encoder and decoder RNN.
Apologies if this question does not belong on this forum.