Copied words in MT

lkluo · July 16, 2019, 7:48am

Without pre-processing, NMT tries to translate every token in one language into another. On the other hand, it is common that many words/phrases do not need to be translated but are just copied from source language, especially for named entities. In practice, it will be annoying if certain words/terms are translated incorrectly. It is aware that by detecting the phrases before NMT will be an options if the detection is perfect (which is not). For the translate APIs, Google and Microsoft seem handle this quite well. Does OpenNMT address this problem? Or any advice? Thanks.

park · July 16, 2019, 8:39am

In OpenNMT-py there is a copy attention.

http://opennmt.net/OpenNMT-py/onmt.translation.html?highlight=copy%20attention

lkluo · July 16, 2019, 9:01am

Thanks. Even thought copying mechanism can identify copied words, but they are only restricted to the seen words in the training set, right? In other words, iff the NMT learns the copied words, it is capable to copy them during translation. For any unknown words, the pre-processing tokenize them into sub-words (suppose we are using Transformer model), and it still tries to translate those sub-words even though copy attention is enable?
Thank you.

park · July 16, 2019, 12:45pm

Even thought copying mechanism can identify copied words, but they are only restricted to the seen words in the training set, right?

Yes

For any unknown words, the pre-processing tokenize them into sub-words (suppose we are using Transformer model), and it still tries to translate those sub-words even though copy attention is enable?

Yes

LanceNorskog · July 26, 2019, 1:07am

Some projects use a separate Named Entity Resolver on the input text. They substitute one of a few well-known markers for the place name and then it’s up to you to find the marker in the output and copy the original.

This is very helpful if the place name is, for example, “Truth or Consequences, New Mexico”.