I want to know is there any experiment of the comparison of lua verison and c++ version? Is the c++ version run faster than torch?
Here are the results of the benchmark I just ran on my desktop Intel i7. It compares tokens per second (higher is better) on the English-German model and the first 100 sentences of data/src-test.txt
.
Threads | Batch size | Beam size | Lua (GPU*) | Lua (CPU) | Lua (CPU MKL) | CTranslate |
---|---|---|---|---|---|---|
1 | 30 | 5 | 646.8 | 66.9 | 76.6 | 69.1 |
1 | 30 | 1 | 535.1 | 103.6 | 151.4 | 202.1 |
1 | 1 | 5 | 209.0 | 22.7 | 38.1 | 33.7 |
1 | 1 | 1 | 166.9 | 22.7 | 54.7 | 43.2 |
2 | 30 | 5 | 646.8 | 90.6 | 106.3 | 96.3 |
2 | 30 | 1 | 535.1 | 125.4 | 213.4 | 310.8 |
2 | 1 | 5 | 209.0 | 24.0 | 52.3 | 51.2 |
2 | 1 | 1 | 166.9 | 23.9 | 66.1 | 70.0 | 4 | 30 | 5 | 646.8 | 104.0 | 128.7 | 116.2 |
4 | 30 | 1 | 535.1 | 128.5 | 207.0 | 392.7 |
4 | 1 | 5 | 209.0 | 24.1 | 52.8 | 62.2 |
4 | 1 | 1 | 166.9 | 23.3 | 61.1 | 84.9 |
Because matrix operations take most of the time, it mostly compares Torch+OpenBLAS and Eigen. Note that using Torch+Intel® MKL may change completely the results: previous experiments showed that it is faster with multiple threads.
* GTX 1080
Thanks very much, it seems that CTranslate didn’t speed up too much without multithread.
are these results cpu or gpu based ?
CTranslate only supports CPU translation for now.
Would it be possible to add a GPU translate.lua column to the above table?
Here you go. I simply copy-pasted for each threads count as it works differently for the GPU.
Cool. What GPU was that run on?
I’m seeing about an order of magnitude difference between the secs/sent reporting on my laptop, which has an i7-6820HK 2.70GHz (x8) processor and a single GTX 1080.
Oh forgot to mention, it is a GTX 1080.
Have you (or anyone, for that matter) done any benchmarks on AWS? I’d be very curious to see EC2 performance numbers… Maybe comparing a c4.2xlarge to a p2.xlarge. Amazon uses K80 cards in their p-series instances, and the c4 has Xeon E5-2666 Haswell CPUs.
On-demand compute time pricing for the p2.xl is more than double the c4.2xl, so it’d be interesting to know that if you had a very large volume of data to translate, what the breakeven point is…
I have new numbers on a larger model (bidirectional LSTM 4x800, closer to “real-world” models). Again, I report tokens per second (higher is better).
- the CPU is a desktop Intel® i7 (4 physical cores)
- the GPU is a GTX 1080
- Torch is compiled with Intel® MKL
Beam 1
Threads | Batch size | CTranslate (CPU) | translate.lua (CPU) | translate.lua (GPU) |
---|---|---|---|---|
1 | 1 | 19.1 | 31.5 | |
1 | 4 | 36.9 | 43.6 | |
1 | 16 | 81.9 | 79.1 | |
1 | 32 | 101.1 | 96.3 | |
1 | 64 | 116.8 | 102.1 | |
1 | 128 | 123.1 | 105.8 | |
2 | 1 | 28.2 | 37.7 | |
2 | 4 | 57.4 | 56.4 | |
2 | 16 | 125.8 | 110.7 | |
2 | 32 | 156.6 | 131.4 | |
2 | 64 | 176.4 | 140.8 | |
2 | 128 | 179.9 | 148.5 | |
4 | 1 | 37.5 | 34.8 | |
4 | 4 | 77.1 | 65.5 | |
4 | 16 | 157.2 | 122.9 | |
4 | 32 | 200.2 | 156.7 | |
4 | 64 | 220.6 | 169.0 | |
4 | 128 | 220.9 | 183.5 | |
8 | 1 | 37.8 | 32.0 | 137.2 |
8 | 4 | 79.6 | 58.6 | 269.1 |
8 | 16 | 161.4 | 111.7 | 523.4 |
8 | 32 | 202.0 | 148.2 | 708.8 |
8 | 64 | 219.9 | 154.4 | 874.4 |
8 | 128 | 219.4 | 159.9 | 1024.6 |
Beam 5
Threads | Batch size | CTranslate (CPU) | translate.lua (CPU) | translate.lua (GPU) |
---|---|---|---|---|
1 | 1 | 13.9 | 17.4 | |
1 | 4 | 25.3 | 27.5 | |
1 | 16 | 42.6 | 37.7 | |
1 | 32 | 45.9 | 41.8 | |
1 | 64 | 46.8 | 42.5 | |
1 | 128 | 36.1 | 39.0 | |
2 | 1 | 22.3 | 23.0 | |
2 | 4 | 40.0 | 39.6 | |
2 | 16 | 63.5 | 55.5 | |
2 | 32 | 66.7 | 61.6 | |
2 | 64 | 63.9 | 61.9 | |
2 | 128 | 60.6 | 55.3 | |
4 | 1 | 29.6 | 26.5 | |
4 | 4 | 54.3 | 47.1 | |
4 | 16 | 77.5 | 69.5 | |
4 | 32 | 81.5 | 76.9 | |
4 | 64 | 74.6 | 76.9 | |
4 | 128 | 70.0 | 69.4 | |
8 | 1 | 31.3 | 24.5 | 121.4 |
8 | 4 | 54.7 | 43.0 | 207.2 |
8 | 16 | 77.9 | 63.6 | 341.5 |
8 | 32 | 80.7 | 67.7 | 411.1 |
8 | 64 | 75.9 | 67.5 | 461.7 |
8 | 128 | 71.9 | 61.3 | 491.5 |
Hi,
just if somebody is surpised that 8 threads have no performance improve to 4 threads. This is just a problem of the CPU, because it has only 4 Core. “Intel hyper threading” has no performance improve if all the threads using floating-point operations. Each core can only calculate 1 floating-point operation the same time.
Thank you for those numbers, very interesting!
I will reference a related github issue here.
tl;dr:
I am running out of memory constantly on a GTX1060, even with batch_size = 1 and beam_size = 4, using the en-de model.
Maybe someone got any tips?
This thread is not related to OpenNMT-py. The issue you linked is the right place to discuss about this specific problem.