Model Serving services

Hi guys,

I’m currently serving OpenNMT-tf models in our on-premise servers, but due to several power outages for the last few months, we decided to put everything to the Cloud, I’m currently looking at AWS SageMaker and GCP services.
So just wondering what services do you guys use for production model servings?


Dear Owen,

As you might already know, if you want to serve locally, you need to have a UPS (uninterruptible power supply).

For serving online, AWS SageMaker can be unnecessarily expensive. You can look into AWS E2, Google Cloud Compute Engine, Azure Compute Virtual Machines, Digital Ocean, etc.

You do not necessarily need a GPU for serving. Using CTranslate2 and a well-written API on a CPU with good capabilities should be enough.

Kind regards,

We have one of the best UPS devices, however, they don’t last 5 minutes when we have 8-GPUs running in one single machine.

The reason why we must use GPU is that CPU translation is extremely extremely extremely slow. We did a benchmark by translating 3000 English sentences into French.
4 CPUs took 435 seconds (Intel Xeon Silver 4216)
1 GPU took 25.8 seconds (RTX 2080Ti)

This depends on your use case. If you have batch translation, yes - this is somehow slow. However, if you translate a sentence by sentence, e.g. in a CAT tool, this can be still reasonable, assuming of course you might need multiple CPUs.

Anyhow, all the services I mentioned above offer GPU services except DigitalOcean. I think Google is the cheapest; if you go for 1 and 3 years commitments, some bargains can be comparable.

Other cheaper services include: Linode, LambdaLabs, Paperspace Core, and Thoplam.

I hope this helps.

Kind regards,

Thank you for all the info. Yes GCP seems to offer the lowest price, I will consult with our senior managers as GCP only offers their services in non-Canada regions, if the data transmission to US region is fine, then I will go for GCP. AWS SageMaker is indeed crazy expensive, I did an estimation and it costs close to $3000 USD per month using their cheapest instances running just a Test T4 GPU.

We need the performance for batch translations, we do have our own version of CAT tool(not Memsource or SDL) implemented for our web app, but most of our customers don’t use CAT tool as they are not translators, they just expect the documents to be translated in the fastest time, so performance is a big concern here.

1 Like

Good Luck! Linode seems to have data centres in Toronto with specific security options.

1 Like

Were you using TensorFlow or CTranslate2 to produce these numbers?


Hello Yasmin,

Out of curiosity, for which reasons should a person consider more Google vs Linode or some of the other cheaper services?

Any additional benefit with Google /AWS or the big player like that?

Best regards,

Hi Samuel!

  • Some highly regulated clients require security certificates and legal conditions that only big players can afford.

  • These services are not exactly more expensive in all cases. The GPU market is yet to evolve. So companies calculate the eventual cost of services for everything (not only GPU), and might prefer to stick to one (stable) provider for seamlessness of operations (implementation, integration, maintenance, etc.)

  • For research and non-profit organizations, these big players are the ones that offer relatively generous grants for those to be able to start. If you check their websites or contact their support, you can get more details about such offers.

Kind regards,

1 Like

Hi Owen!

I see now they have GPU options in Montreal and Toronto, mainly NVIDIA Tesla P4.

Kind regards,

1 Like

Good point. To kick off my Tagalog-English project last year I got a grant/credit of $200 from Microsoft Azure to use their SMT Eng-Tag model to create some “synthetic source”.