Best way to handle emojis during translation

I have a translation model that does a very good work of translating from Spanish to English. However, while translating real-world texts, I noticed that it was not handling emojis properly. For most of the cases, there were no emojis in the translated text and in few rare cases where there was, it was wrong emoji (mostly the same one). As only a few of my training dataset sentences contain emojis, this makes sense.

However, I have the replace_unk option ON. I was hoping with this option ON, these emojis in the source sentence would have the highest attention weight and would be copied to replace unknown token.

My question is two-fold:

  1. Why is the replace_unk option not working in the case of emoji?
  2. What is the best way to handle emojis during translation?

Thanks.

Hi,

What type of model are you using? If you’re using Transformer, the -replace_unk option is not optimal. See Guillaume’s answer here.

We’re thinking of adding some ‘guided alignment’ feature (already implemented in OpenNMT-tf) to go around these limitations.

Hi @francoishernandez,
I am using the Transformer model and the behavior I am seeing with using -replace_unk option is consistent with @guillaumekln’s answer.

Do you know about the time scale if/when the ‘guided alignment’ feature will be added to the OpenNMT-py?

Thanks.

Btw, @francoishernandez @guillaumekln, if I use -replace_unk option with the Transformer model, what happens? Seems like some sort of copying mechanism is working as I don’t get <unk> tokens. Do you know what is being copied?

Thanks!

It will still replace the <unk> target token by the source token with the highest attention weight. However, the Transformer attention weights can usually not be used as target-source alignment so the selected source token is basically random.

Thanks for the info. Yes, basically random is what I am seeing as well.

Hi @francoishernandez and @guillaumekln,

As there is a new feature based on the paper Jointly Learning to Align and Translate with Transformer Models, I was wondering if I were to provide reference alignments while training (to invoke multi-task learning on translation and alignment) and set replace_unk to True during translation, will the unk be replaced properly (and not randomly) now? Is this the ‘guided alignment’ feature that @francoishernandez had mentioned earlier?

Regards

Yes you can definitely try that.
You can follow @Zenglinxiao’s guide: https://opennmt.net/OpenNMT-py/FAQ.html#can-i-get-word-alignment-while-translating .
Let us know how it goes!

Sure, will do. Thanks!

Hi @francoishernandez,
I tried the guided alignment and the alignment that I am getting is making sense. However, setting replace_unk to True during translation is still giving results similar to the ones I was getting without guided alignment. Do you know if anyone has tested replace_unk with guided alignment?

I can use the alignments I get during translation (with report_align option) to replace <unk> with aligned src token but I wanted to check in to see if I am using replace_unk properly and am not missing anything.

Thanks

Yes you’re correct, -replace_unk has not been adapted to handle this case. Should not be that difficult as it’s just a matter of taking the right attention matrix to do the mapping. Feel free to PR if you do it!

Hello @ArbinTimilsina,
I just learned that you want to use alignments to do --replace_unk, and it’s true that this has not been enabled before.
Therefore, I just opened a PR #1731 for this. It should work even though I haven’t tested it.
It would be great if you can test it and give us your feedback.

Hi @Zenglinxiao,
Awesome! I did a quick test and it seems to be working (with a model trained with supervised guided alignment).

For example:
without replace_unk, I got:

SENT 1: ['▁mi', '▁nombre', '▁es', '▁', 'ぜひお試し', '.', '▁fue', '▁un', '▁placer', '▁conocerte', '.']
PRED 1: ▁my ▁name ▁is ▁ <unk> . ▁it ▁was ▁a ▁pleasure ▁to ▁meet ▁you .

with replace_unk, I got:

SENT 1: ['▁mi', '▁nombre', '▁es', '▁', 'ぜひお試し', '.', '▁fue', '▁un', '▁placer', '▁conocerte', '.']
PRED 1: ▁my ▁name ▁is ▁ ぜひお試し . ▁it ▁was ▁a ▁pleasure ▁to ▁meet ▁you .

I have not done any detailed tests other than this. But, on the first order, this seems to be working.

I will let you know if I discover anything else in the future.

Thanks a lot for the PR!

Regards,
Arbin

1 Like

Hi all,
I see that this PR has been merged. Will it be possible to release a new version of OpenNMT-py (which includes this change) in pip?

Thanks!

@francoishernandez, @pltrdy, I see that you are the maintainers of the OpenNMT-py PyPI. Will it be possible to release a new version of OpenNMT-py which includes the latest changes?

Thanks.

Hi @ArbinTimilsina
Sure, we’ll release an updated version to PyPI.
EDIT: v1.0.2 uploaded

Also, please note you can install the git repo with pip if you want to try the most up to date code:
pip install git+https://github.com/OpenNMT/OpenNMT-py

1 Like

Awesome! Thanks a lot @francoishernandez

Hi @Zenglinxiao @francoishernandez,
I am coming across the following error while translating a document. This happens only when translating with a model trained with guided alignment. Do you know what might be happening?

File "/home/atimilsina/Work/test-preprocessing-training/OpenNMT-py/onmt/translate/translator.py", line 352, in translate
    batch, data.src_vocabs, attn_debug
  File "/home/atimilsina/Work/test-preprocessing-training/OpenNMT-py/onmt/translate/translator.py", line 547, in translate_batch
    decode_strategy)
  File "/home/atimilsina/Work/test-preprocessing-training/OpenNMT-py/onmt/translate/translator.py", line 708, in _translate_batch_with_strategy
    batch, decode_strategy.predictions)
  File "/home/atimilsina/Work/test-preprocessing-training/OpenNMT-py/onmt/translate/translator.py", line 511, in _align_forward
    alignment_attn, prediction_mask, src_lengths, n_best)
  File "/home/atimilsina/Work/test-preprocessing-training/OpenNMT-py/onmt/utils/alignment.py", line 59, in extract_alignment
    .view(valid_tgt_len, -1)
RuntimeError: cannot reshape tensor of 0 elements into shape [0, -1] because the unspecified dimension size -1 can be any value and is ambiguous

Hello, I just notice your issue of using guided alignment. This Error occurs when there is no valid target token for one specific sentence in a batch. In the function of extract_alignment, we feed the alignment attention head with the batches’s padding information, we get rid of all paddings, the rest is valid real tokens that should be considered to get the corresponding alignments.
In your case, this Error suggests that one example in the batch has no valid tokens in the target side, in other words, blank prediction, which is weird. Would you mind to do a try…catch to get the original src and its prediction? This is helpful for us to figure out how to fix this issue.

@Zenglinxiao,
To verify your assumption, I changed the batch_size to 2 and added the following lines in _translate_batch_with_strategy function of translator.py.

        results["predictions"] = decode_strategy.predictions
        results["attention"] = decode_strategy.attention

        flatten_prediction = [
            best.tolist() for bests in decode_strategy.predictions for best in bests
        ]
        if self.report_align:
            try:
                results["alignment"] = self._align_forward(
                    batch, decode_strategy.predictions)
            except RuntimeError:
                print("src:", src.tolist())
                print("predictions:", flatten_prediction)
        else:
            results["alignment"] = [[] for _ in range(batch_size)]
        return results

The output I got is:

src: [
[[13325], [5053]], [[3945], [84]], [[2023], [1]], [[4285], [1]], [[5618], [1]], [[20846], [1]], [[6506], [1]], [[12634], [1]], [[19949], [1]], [[31744], [1]], [[30032], [1]], [[14311], [1]], [[8845], [1]], [[26022], [1]], [[28139], [1]], [[14709], [1]], [[12405], [1]], [[12403], [1]], [[31857], [1]], [[4213], [1]], [[1006], [1]], [[18492], [1]], [[4918], [1]], [[5053], [1]], [[36], [1]], [[30828], [1]], [[14311], [1]], [[11690], [1]], [[26022], [1]], [[22876], [1]], [[13718], [1]], [[31744], [1]], [[29656], [1]], [[15038], [1]], [[30703], [1]], [[12287], [1]], [[12403], [1]], [[11072], [1]], [[31852], [1]], [[4852], [1]], [[31745], [1]], [[15721], [1]], [[2], [1]]
]

predictions: [
[29282, 29427, 16039, 27048, 7452, 23771, 29427, 9663, 8713, 5676, 4947, 1353, 17731, 1713, 11, 30663, 29427, 20410, 31845, 16115, 30650, 7452, 26942, 31845, 6594, 11350, 10892, 7452, 29423, 25, 4598, 18679, 4, 3], 
[3]
]

Your assumption seems to be correct: the error is the result of the blank prediction. I think this is happening for something like this: the input text is a single character (like a replacement character ďż˝) that SentencePiece encoder returns as an empty string and the src text end up containing just an empty string.

Any idea how to fix this blank prediction issue?

Thanks!

ps: For some reason, I don’t encounter the error when I set the batch_size to 1. Any reason why this would be the case?