Basic Overview of Using NMT as Translation Engine

jukhamil · March 28, 2021, 6:18am

Hey,

I’d like to try to integrate an open-source translation engine into a CAT tool, for Swedish to English translation.

I was wondering, first of all, how does one gather the data on which to train OpenNMT? Can it perform comparatively to Google Translate or Facebook’s new translation engine? Why or why not? I mean, are their learning algorithms fundamentally better, for any reason - industry-secrets, or more computing power? And what about the data, the corpora or web crawler they use? Is it at all possible for an individual to set one up just as good as theirs? How so? Or, more broadly, is it possible that a machine translation system could be as comprehensive as a state of the art dictionary? For example, if we could simply feed it the most exhaustive corpus imaginable, could we hope it could provide very effective, encyclopedic translation suggestions for a wide variety of obscure terms and expressions? In other words, that the system can actually begin to compete with the best known dictionaries in its coverage and accuracy - or even, be superior.

And I’m lastly also wondering, is there any exhaustive list or keyword out there for computational systems that can provide any kind of word reference? It could be a list of synonyms, a list of translations, or any kind of semantic content or analysis that in effect can provide a “definition” or in essence clarification on “what this word means”, roughly? I ask just to know as a translator what tools are out there beyond dictionaries and machine translation.

Thanks very much.

guillaumekln · March 29, 2021, 8:10am

Hi,

You asked a lot of questions, so I will answer briefly. Others are welcome to expand if they know more.

There are many data sources that are freely available online. You can check for example: https://opus.nlpl.eu/

All machine translation providers are using similar algorithms. Most of the differences come from the quantity, quality, and preparation of the data. Computing power also plays a role but does not impact translation quality strictly speaking.

In general it’s possible to train models that are as good as these companies in specific cases, but it is difficult to train models that are as robust and versatile. This requires a lot of work.

Not really. A NMT system is basically a smart lossy compression of the training data. It manages to fit several gigabytes of knowledge into a few hundred megabytes. So it can’t be as comprehensive and precise as a full dictionary.

dmarin · March 29, 2021, 11:29am

Hi,

Here are my two cents.

First of all, it is quite difficult to answer most of these questions without knowing, at least, the use case (gisting, post-editing, support…), context (freelance translation, non-translation-related business…), end-user/s (translators, customers…) and even the domain (general or domain-specific…).

But generally speaking, if the texts being translated don’t belong to any specific domain and there are no particular environment constraints (such as, for instance, running the engines offline for security reasons), I think proprietary and commercial MT solutions are most practical, as one can get good overall quality for a fraction of the cost of building an engine from scratch.

On the other hand, building a custom engine may be interesting if the output must be adapted to a specific domain in terms of terminology, style, etc. The degree of success of this adaptation usually depends more on the data used to train the models than on the architecture or algorithms themselves. So, if this would be your case, you might want to look at solutions that allow you to build engines with your own dataset without entering in all the implementation details.

Finally, if the question is whether it is doable to build an engine with an open-source toolkit that produces acceptable quality for a given purpose, the answer is yes. In my opinion, OpenNMT is an excellent solution for this for so many reasons (customisation, domain-adaptation features, performance, community, documentation, etc.). If it is reasonable to do it, this just depends on the use case, the resources available, the technical expertise, etc.