If there is any way to keep placeholders as same as source when call NMT


(liluhao1982) #1

Hi,

In some situation, we do NOT want some words (e.g: placeholders) to be translated.
e.g: There some placeholders in below source segment (e.g: {9990}, {9991}…)
src=Sign in with a {■ 9990 ■}■ Microsoft ■{■ 9991 ■} Account Sign in with {■ 9992 ■}■ Facebook ■{■ 9993 ■} Sign in with Twitter
But the translation from NMT engine:Almost all placeholders are missing.
tgt=GM con un autógrafo 9990}Microsoft Microsoft}}} en con Michael 9992}Facebook Facebook}}}} en con EUB

When I fill same source segment to Microsoft Bing Translator, these placeholders can be kept as same as source in proper position.

Is there any way to keep such placeholder as same as source and appear in proper position in translation as source when call NMT?

Thanks.


Handle numbers, urls, dates
(jean.senellart) #2

Hi @liluhao1982 - what you get is what you trained your model to do. So if you want that your model preserve your placeholders, you do need to have some special training.

What I would suggest is that you keep your placeholders as one single tokens, and do not have them with increasing count - otherwise your model will have to learn for each of them how to deal with them. For instance (or whatever).
Then you need to make sure you have enough (possibly artificial) examples in your training corpus so that the model learn to deal correctly with them.


Unexpected translation of tagged text