Rest server throughput

I’m using the Rest server and noticed that the throughput is ~ 2 translations/second. Since I’m running the server on a gpu is there any way to accept a batch of inputs in the request to leverage matrix-matrix multiplications compared to matrix-vector multiplications as might be happening currently? This will increase the throughput.

As of this writing, the REST server is quite slow for me as well. I’ve had much better success with the 0mq server. You can adjust your batch size within the constraints of your model size and GPU memory available. For a medium-sized model, I’ve managed to get 1000 sentences through the 0mq server in under 35 seconds (GTX 1080).

In theory the only difference between 0mq and REST is the tokenization and detokenization steps before and after.

I profiled it and the translation it self is what takes the most time.

I also had about 0.2-0.5 sec per translation (on GPU same with beam 1 or 5)
On CPU it was much longer and beam=1 is a requirement to make it workable.

Having said that, yes, a second REST function accepting a batch of sentence would be a must.

1 Like

An additional method that takes a list of segments would be great! Or a single method that takes batches, leaving individuals with the choice of batch sizes of 1 should they so choose. :wink:

@vince62s are you saying that even if I send a batch of say 25, the 0MQ server version would perform 25 separate forward pass computations through the graph one after the other instead of doing a batched forward pass and decoding?

Maybe @jean.senellart or @guillaumekln might want to comment.

I am pretty sure that yes, because I took most of the code from that.

We will try to allow batches to the REST server quite soon, but PR are welcome ! :slight_smile:

EDIT: Guillaume is right ! I had taken a shortcut because the tokenizer API does not take a batch …
So needs to be done, when time allows.

Actually, the ZeroMQ server supports true batch translation. There is a misleading comment in the code that let to believe the opposite but it actually builds a batch input if several source sentences are received.

Ok. Maybe we should update the documentation then. Can you point to the document if it’s on Github? I can send a PR with updated documentation

Do you mean the static website? Here:

https://github.com/OpenNMT/opennmt.github.io

I’ll check in a PR soon with batch of sentences for the Rest server.

be patient. I think the doc is fine for 0mq.

@Wabbit
if you have time can you please test the PR and let me know the speed results you get ?
thanks.
V.

Sorry for jump in. I want to raise a new ticket but this topic is related to mine.

I started to use the Rest server from the pilot release, see article here:

I have the same feeling that the speed is slow for previous version of Rest Server version, I mean if I send one sentence for translation, it will usually take long time for responding.

I test the latest version, it seems the speed is much improved. Thanks for your great work.

But it seems the detokenization for translation is missing now, please see my screenshots below for detail. This is important feature i think which should be kept. Would you please check?

This is the expected output. You may want to use the -joiner_annotate flag that generates the information needed to reattach punctuation symbols.

1 Like

Thanks for the hint.