Our company is trying to come up with its own corporate machine translator. Currently, we are considering openNMT is a foundation. It should be mentioned we do not have any previous experience in Machine Translation or whatsoever and therefore I beg your pardon if my questions seem to be naive, but I assume this is the right place to enquire:
Is it worthwhile for a company trying to come up with its own machine translation service to reduce the cost for translation of its content in the long term rather than paying other services such as Google translate? Or is it too expensive, energy-intensive and we would need a decade to make something that would slightly look like a decent machine translator?
How complex and resource-effective is the machine translator process itself? (servers, expenses, etc.)
In order to avoid developing the translator from scratch, perhaps, there might be some pretrained models we could use (English, Spanish, Russian)? If there are, where can they be found?
Nowadays it is relatively easy to train and deploy a NMT model with good quality. It could be a matter of months I would say. However, machine translation providers already put a lot of effort in figuring out many details from the data preparation to the serving infrastructure.
It really depends what service you want to build, how many language pairs you want to propose, and how you plan to scale. Training models will require GPU servers for sure (a single model could require a week of training on a single GPU). Then in production, you would still need compute oriented servers (good CPU, 8GB+ or RAM, ideally a GPU to maximize performance).
There are some pretrained models available on the OpenNMT website. But except the English-German models, they are not really ready for production:
Hi, As somebody who has built useful OpenNMT systems for clients I can say that the most important thing is to have somebody in your organisation who takes ownership of the project. If you are looking for “on-premise” solutions you could buy two GPU-fitted machines for under $6000. But that’s the easy bit! You will then need to acquire, clean and train and test appropriate data - that’s where the heavy lifting starts.I’d say you’d probably need around three months from a zero baseline to reach the stage where one of your models gives you acceptable translations. And - as the site is totally free - I can mention www.nmtgateway.com which shows baseline models for Dutch<>English, Indonesian<>English and Turkish<>English.
I hope you wouldn’t mind if I get a little bit more specific.
You’ve said we are going to need both GPU and compute oriented servers. Let’s say we are planning to start with three languages: English - Russian - Spanish. The main subject matter of content is agriculture.
Then, the question is how many of both GPU and compute oriented servers would we need in order to gain a decent result in a 6 month period? Secondly, if we consider realtime translation mode, just like in the case of systran, where the request stream is 1 per second, would we need 3 or 3000 servers?
Perhaps, we could shortcut the whole thing and buy a turnkey solution or hire a pro who would provide a thorough consultation to guide us through the process. We would appreciate if you could recommend us any of those two options.