Is the speed of CTranslate faster than that of lua version?

maplewizard · February 23, 2017, 8:51am

I want to know is there any experiment of the comparison of lua verison and c++ version? Is the c++ version run faster than torch?

guillaumekln · February 23, 2017, 9:43am

Here are the results of the benchmark I just ran on my desktop Intel i7. It compares tokens per second (higher is better) on the English-German model and the first 100 sentences of data/src-test.txt.

Threads	Batch size	Beam size	Lua (GPU*)	Lua (CPU)	Lua (CPU MKL)	CTranslate
1	30	5	646.8	66.9	76.6	69.1
1	30	1	535.1	103.6	151.4	202.1
1	1	5	209.0	22.7	38.1	33.7
1	1	1	166.9	22.7	54.7	43.2
2	30	5	646.8	90.6	106.3	96.3
2	30	1	535.1	125.4	213.4	310.8
2	1	5	209.0	24.0	52.3	51.2
2	1	1	166.9	23.9	66.1	70.0
4	30	5	646.8	104.0	128.7	116.2
4	30	1	535.1	128.5	207.0	392.7
4	1	5	209.0	24.1	52.8	62.2
4	1	1	166.9	23.3	61.1	84.9

Because matrix operations take most of the time, it mostly compares Torch+OpenBLAS and Eigen. Note that using Torch+Intel® MKL may change completely the results: previous experiments showed that it is faster with multiple threads.

* GTX 1080

maplewizard · February 24, 2017, 9:07am

Thanks very much, it seems that CTranslate didn’t speed up too much without multithread.

vince62s · February 24, 2017, 9:11am

are these results cpu or gpu based ?

guillaumekln · February 24, 2017, 9:12am

CTranslate only supports CPU translation for now.

dbl · February 24, 2017, 9:59am

Would it be possible to add a GPU translate.lua column to the above table?

guillaumekln · February 24, 2017, 10:34am

Here you go. I simply copy-pasted for each threads count as it works differently for the GPU.

dbl · February 24, 2017, 10:44am

Cool. What GPU was that run on?

I’m seeing about an order of magnitude difference between the secs/sent reporting on my laptop, which has an i7-6820HK 2.70GHz (x8) processor and a single GTX 1080.

guillaumekln · February 24, 2017, 10:45am

Oh forgot to mention, it is a GTX 1080.

dbl · February 24, 2017, 12:07pm

Have you (or anyone, for that matter) done any benchmarks on AWS? I’d be very curious to see EC2 performance numbers… Maybe comparing a c4.2xlarge to a p2.xlarge. Amazon uses K80 cards in their p-series instances, and the c4 has Xeon E5-2666 Haswell CPUs.

On-demand compute time pricing for the p2.xl is more than double the c4.2xl, so it’d be interesting to know that if you had a very large volume of data to translate, what the breakeven point is…

guillaumekln · May 12, 2017, 9:09am

I have new numbers on a larger model (bidirectional LSTM 4x800, closer to “real-world” models). Again, I report tokens per second (higher is better).

the CPU is a desktop Intel® i7 (4 physical cores)
the GPU is a GTX 1080
Torch is compiled with Intel® MKL

Beam 1

Threads	Batch size	CTranslate (CPU)	translate.lua (CPU)	translate.lua (GPU)
1	1	19.1	31.5
1	4	36.9	43.6
1	16	81.9	79.1
1	32	101.1	96.3
1	64	116.8	102.1
1	128	123.1	105.8
2	1	28.2	37.7
2	4	57.4	56.4
2	16	125.8	110.7
2	32	156.6	131.4
2	64	176.4	140.8
2	128	179.9	148.5
4	1	37.5	34.8
4	4	77.1	65.5
4	16	157.2	122.9
4	32	200.2	156.7
4	64	220.6	169.0
4	128	220.9	183.5
8	1	37.8	32.0	137.2
8	4	79.6	58.6	269.1
8	16	161.4	111.7	523.4
8	32	202.0	148.2	708.8
8	64	219.9	154.4	874.4
8	128	219.4	159.9	1024.6

Beam 5

Threads	Batch size	CTranslate (CPU)	translate.lua (CPU)	translate.lua (GPU)
1	1	13.9	17.4
1	4	25.3	27.5
1	16	42.6	37.7
1	32	45.9	41.8
1	64	46.8	42.5
1	128	36.1	39.0
2	1	22.3	23.0
2	4	40.0	39.6
2	16	63.5	55.5
2	32	66.7	61.6
2	64	63.9	61.9
2	128	60.6	55.3
4	1	29.6	26.5
4	4	54.3	47.1
4	16	77.5	69.5
4	32	81.5	76.9
4	64	74.6	76.9
4	128	70.0	69.4
8	1	31.3	24.5	121.4
8	4	54.7	43.0	207.2
8	16	77.9	63.6	341.5
8	32	80.7	67.7	411.1
8	64	75.9	67.5	461.7
8	128	71.9	61.3	491.5

Olli123 · May 26, 2017, 6:48am

Hi,
just if somebody is surpised that 8 threads have no performance improve to 4 threads. This is just a problem of the CPU, because it has only 4 Core. “Intel hyper threading” has no performance improve if all the threads using floating-point operations. Each core can only calculate 1 floating-point operation the same time.

davidlenz · June 17, 2018, 7:33pm

Thank you for those numbers, very interesting!

I will reference a related github issue here.

tl;dr:
I am running out of memory constantly on a GTX1060, even with batch_size = 1 and beam_size = 4, using the en-de model.
Maybe someone got any tips?

guillaumekln · June 18, 2018, 7:23am

This thread is not related to OpenNMT-py. The issue you linked is the right place to discuss about this specific problem.