Without pre-processing, NMT tries to translate every token in one language into another. On the other hand, it is common that many words/phrases do not need to be translated but are just copied from source language, especially for named entities. In practice, it will be annoying if certain words/terms are translated incorrectly. It is aware that by detecting the phrases before NMT will be an options if the detection is perfect (which is not). For the translate APIs, Google and Microsoft seem handle this quite well. Does OpenNMT address this problem? Or any advice? Thanks.
In OpenNMT-py there is a copy attention.
http://opennmt.net/OpenNMT-py/onmt.translation.html?highlight=copy%20attention
Thanks. Even thought copying mechanism can identify copied words, but they are only restricted to the seen words in the training set, right? In other words, iff the NMT learns the copied words, it is capable to copy them during translation. For any unknown words, the pre-processing tokenize them into sub-words (suppose we are using Transformer model), and it still tries to translate those sub-words even though copy attention is enable?
Thank you.
Even thought copying mechanism can identify copied words, but they are only restricted to the seen words in the training set, right?
Yes
For any unknown words, the pre-processing tokenize them into sub-words (suppose we are using Transformer model), and it still tries to translate those sub-words even though copy attention is enable?
Yes
Some projects use a separate Named Entity Resolver on the input text. They substitute one of a few well-known markers for the place name and then it’s up to you to find the marker in the output and copy the original.
This is very helpful if the place name is, for example, “Truth or Consequences, New Mexico”.