OpenNMT Forum

Best way to handle emojis during translation

I have a translation model that does a very good work of translating from Spanish to English. However, while translating real-world texts, I noticed that it was not handling emojis properly. For most of the cases, there were no emojis in the translated text and in few rare cases where there was, it was wrong emoji (mostly the same one). As only a few of my training dataset sentences contain emojis, this makes sense.

However, I have the replace_unk option ON. I was hoping with this option ON, these emojis in the source sentence would have the highest attention weight and would be copied to replace unknown token.

My question is two-fold:

  1. Why is the replace_unk option not working in the case of emoji?
  2. What is the best way to handle emojis during translation?

Thanks.

Hi,

What type of model are you using? If you’re using Transformer, the -replace_unk option is not optimal. See Guillaume’s answer here.

We’re thinking of adding some ‘guided alignment’ feature (already implemented in OpenNMT-tf) to go around these limitations.

Hi @francoishernandez,
I am using the Transformer model and the behavior I am seeing with using -replace_unk option is consistent with @guillaumekln’s answer.

Do you know about the time scale if/when the ‘guided alignment’ feature will be added to the OpenNMT-py?

Thanks.

Btw, @francoishernandez @guillaumekln, if I use -replace_unk option with the Transformer model, what happens? Seems like some sort of copying mechanism is working as I don’t get <unk> tokens. Do you know what is being copied?

Thanks!

It will still replace the <unk> target token by the source token with the highest attention weight. However, the Transformer attention weights can usually not be used as target-source alignment so the selected source token is basically random.

Thanks for the info. Yes, basically random is what I am seeing as well.