Is the speed of CTranslate faster than that of lua version?

I want to know is there any experiment of the comparison of lua verison and c++ version? Is the c++ version run faster than torch?

Here are the results of the benchmark I just ran on my desktop Intel i7. It compares tokens per second (higher is better) on the English-German model and the first 100 sentences of data/src-test.txt.

Threads Batch size Beam size Lua (GPU*) Lua (CPU) Lua (CPU MKL) CTranslate
1 30 5 646.8 66.9 76.6 69.1
1 30 1 535.1 103.6 151.4 202.1
1 1 5 209.0 22.7 38.1 33.7
1 1 1 166.9 22.7 54.7 43.2
2 30 5 646.8 90.6 106.3 96.3
2 30 1 535.1 125.4 213.4 310.8
2 1 5 209.0 24.0 52.3 51.2
2 1 1 166.9 23.9 66.1 70.0
4 30 5 646.8 104.0 128.7 116.2
4 30 1 535.1 128.5 207.0 392.7
4 1 5 209.0 24.1 52.8 62.2
4 1 1 166.9 23.3 61.1 84.9

Because matrix operations take most of the time, it mostly compares Torch+OpenBLAS and Eigen. Note that using Torch+Intel® MKL may change completely the results: previous experiments showed that it is faster with multiple threads.

* GTX 1080

Thanks very much, it seems that CTranslate didn’t speed up too much without multithread.

are these results cpu or gpu based ?

CTranslate only supports CPU translation for now.

Would it be possible to add a GPU translate.lua column to the above table?

Here you go. I simply copy-pasted for each threads count as it works differently for the GPU.

Cool. What GPU was that run on?

I’m seeing about an order of magnitude difference between the secs/sent reporting on my laptop, which has an i7-6820HK 2.70GHz (x8) processor and a single GTX 1080.

Oh forgot to mention, it is a GTX 1080.

Have you (or anyone, for that matter) done any benchmarks on AWS? I’d be very curious to see EC2 performance numbers… Maybe comparing a c4.2xlarge to a p2.xlarge. Amazon uses K80 cards in their p-series instances, and the c4 has Xeon E5-2666 Haswell CPUs.

On-demand compute time pricing for the p2.xl is more than double the c4.2xl, so it’d be interesting to know that if you had a very large volume of data to translate, what the breakeven point is…

I have new numbers on a larger model (bidirectional LSTM 4x800, closer to “real-world” models). Again, I report tokens per second (higher is better).

  • the CPU is a desktop Intel® i7 (4 physical cores)
  • the GPU is a GTX 1080
  • Torch is compiled with Intel® MKL

Beam 1

Threads Batch size CTranslate (CPU) translate.lua (CPU) translate.lua (GPU)
1 1 19.1 31.5
1 4 36.9 43.6
1 16 81.9 79.1
1 32 101.1 96.3
1 64 116.8 102.1
1 128 123.1 105.8
2 1 28.2 37.7
2 4 57.4 56.4
2 16 125.8 110.7
2 32 156.6 131.4
2 64 176.4 140.8
2 128 179.9 148.5
4 1 37.5 34.8
4 4 77.1 65.5
4 16 157.2 122.9
4 32 200.2 156.7
4 64 220.6 169.0
4 128 220.9 183.5
8 1 37.8 32.0 137.2
8 4 79.6 58.6 269.1
8 16 161.4 111.7 523.4
8 32 202.0 148.2 708.8
8 64 219.9 154.4 874.4
8 128 219.4 159.9 1024.6

Beam 5

Threads Batch size CTranslate (CPU) translate.lua (CPU) translate.lua (GPU)
1 1 13.9 17.4
1 4 25.3 27.5
1 16 42.6 37.7
1 32 45.9 41.8
1 64 46.8 42.5
1 128 36.1 39.0
2 1 22.3 23.0
2 4 40.0 39.6
2 16 63.5 55.5
2 32 66.7 61.6
2 64 63.9 61.9
2 128 60.6 55.3
4 1 29.6 26.5
4 4 54.3 47.1
4 16 77.5 69.5
4 32 81.5 76.9
4 64 74.6 76.9
4 128 70.0 69.4
8 1 31.3 24.5 121.4
8 4 54.7 43.0 207.2
8 16 77.9 63.6 341.5
8 32 80.7 67.7 411.1
8 64 75.9 67.5 461.7
8 128 71.9 61.3 491.5

Hi,
just if somebody is surpised that 8 threads have no performance improve to 4 threads. This is just a problem of the CPU, because it has only 4 Core. “Intel hyper threading” has no performance improve if all the threads using floating-point operations. Each core can only calculate 1 floating-point operation the same time.

1 Like

Thank you for those numbers, very interesting!

I will reference a related github issue here.

tl;dr:
I am running out of memory constantly on a GTX1060, even with batch_size = 1 and beam_size = 4, using the en-de model.
Maybe someone got any tips?

This thread is not related to OpenNMT-py. The issue you linked is the right place to discuss about this specific problem.