I have a translation model that does a very good work of translating from Spanish to English. However, while translating real-world texts, I noticed that it was not handling emojis properly. For most of the cases, there were no emojis in the translated text and in few rare cases where there was, it was wrong emoji (mostly the same one). As only a few of my training dataset sentences contain emojis, this makes sense.
However, I have the replace_unk option ON. I was hoping with this option ON, these emojis in the source sentence would have the highest attention weight and would be copied to replace unknown token.
My question is two-fold:
- Why is the replace_unk option not working in the case of emoji?
- What is the best way to handle emojis during translation?