Would you please help me to find out why the torch version is much much slower comparing to the PyTorch version. More specifically, when I run both version on a Geforce GTX 970 using the default configuration, the pytorch version processes nearly 3000 token per second while the torch version processes only abou 300 tokens per second. I have install cuda 8 and cudnn6. Is there anything which I am missing? Many thanks in advance!