Best way to handle emojis during translation

Hello, I just notice your issue of using guided alignment. This Error occurs when there is no valid target token for one specific sentence in a batch. In the function of extract_alignment, we feed the alignment attention head with the batches’s padding information, we get rid of all paddings, the rest is valid real tokens that should be considered to get the corresponding alignments.
In your case, this Error suggests that one example in the batch has no valid tokens in the target side, in other words, blank prediction, which is weird. Would you mind to do a try…catch to get the original src and its prediction? This is helpful for us to figure out how to fix this issue.

@Zenglinxiao,
To verify your assumption, I changed the batch_size to 2 and added the following lines in _translate_batch_with_strategy function of translator.py.

        results["predictions"] = decode_strategy.predictions
        results["attention"] = decode_strategy.attention

        flatten_prediction = [
            best.tolist() for bests in decode_strategy.predictions for best in bests
        ]
        if self.report_align:
            try:
                results["alignment"] = self._align_forward(
                    batch, decode_strategy.predictions)
            except RuntimeError:
                print("src:", src.tolist())
                print("predictions:", flatten_prediction)
        else:
            results["alignment"] = [[] for _ in range(batch_size)]
        return results

The output I got is:

src: [
[[13325], [5053]], [[3945], [84]], [[2023], [1]], [[4285], [1]], [[5618], [1]], [[20846], [1]], [[6506], [1]], [[12634], [1]], [[19949], [1]], [[31744], [1]], [[30032], [1]], [[14311], [1]], [[8845], [1]], [[26022], [1]], [[28139], [1]], [[14709], [1]], [[12405], [1]], [[12403], [1]], [[31857], [1]], [[4213], [1]], [[1006], [1]], [[18492], [1]], [[4918], [1]], [[5053], [1]], [[36], [1]], [[30828], [1]], [[14311], [1]], [[11690], [1]], [[26022], [1]], [[22876], [1]], [[13718], [1]], [[31744], [1]], [[29656], [1]], [[15038], [1]], [[30703], [1]], [[12287], [1]], [[12403], [1]], [[11072], [1]], [[31852], [1]], [[4852], [1]], [[31745], [1]], [[15721], [1]], [[2], [1]]
]

predictions: [
[29282, 29427, 16039, 27048, 7452, 23771, 29427, 9663, 8713, 5676, 4947, 1353, 17731, 1713, 11, 30663, 29427, 20410, 31845, 16115, 30650, 7452, 26942, 31845, 6594, 11350, 10892, 7452, 29423, 25, 4598, 18679, 4, 3], 
[3]
]

Your assumption seems to be correct: the error is the result of the blank prediction. I think this is happening for something like this: the input text is a single character (like a replacement character �) that SentencePiece encoder returns as an empty string and the src text end up containing just an empty string.

Any idea how to fix this blank prediction issue?

Thanks!

ps: For some reason, I don’t encounter the error when I set the batch_size to 1. Any reason why this would be the case?

Hi @Zenglinxiao,
I could filter out these edge cases on my end, but having a fix in OpenNMT-py’s side will benefit everyone else. Just wondering if you had a chance to think about how to fix this blank prediction issue. Let me know if I can be of any help.

Thanks!

Hello @ArbinTimilsina,
A quick fix for this could be:

def extract_alignment(align_matrix, tgt_mask, src_lens, n_best):
...
        valid_tgt_len = valid_tgt.sum()
        if valid_tgt_len == 0:
            valid_alignment = None
        else:
            # get valid alignment (sub-matrix from full paded aligment matrix)
            am_valid_tgt = am_b.masked_select(valid_tgt.unsqueeze(-1)) \
                               .view(valid_tgt_len, -1)
            valid_alignment = am_valid_tgt[:, :src_len]  # only keep valid src
        alignments[i // n_best].append(valid_alignment)
...

But not sure why batch_size to 1 not envoke the blank prediction issue. In theory, when no valid tgt token, the .view(0, -1) will always raise un Error. Maybe they just don’t have blank prediction when batch_size == 1 ?
Do you have any idea @francoishernandez?

Could this be linked to #1320 and #1547?

@Zenglinxiao,
I tried your recomendaiton but now I am getting

File "/home/atimilsina/Work/preprocessing-training-with-alignment/OpenNMT-py/onmt/translate/translator.py", line 368, in translate
    in trans.word_aligns[:self.n_best]]
  File "/home/atimilsina/Work/preprocessing-training-with-alignment/OpenNMT-py/onmt/translate/translator.py", line 367, in <listcomp>
    align_pharaohs = [build_align_pharaoh(align) for align
  File "/home/atimilsina/Work/preprocessing-training-with-alignment/OpenNMT-py/onmt/utils/alignment.py", line 72, in build_align_pharaoh
    tgt_align_src_id = valid_alignment.argmax(dim=-1)
AttributeError: 'NoneType' object has no attribute 'argmax'

Any idea what would be appropriate for valid_alignment rather than None?

@ArbinTimilsina, thanks for the feedback.
I don’t think we can find a default tensor as a replace of None, as in this case, it indeed will not have any alignment if the target is empty.
To solve the build_align_pharaoh problem, we just need to do the same thing:

def build_align_pharaoh(valid_alignment):
    align_pairs = []
    if isinstance(valid_alignment, torch.Tensor):
        tgt_align_src_id = valid_alignment.argmax(dim=-1)

        for tgt_id, src_id in enumerate(tgt_align_src_id.tolist()):
            align_pairs.append(str(src_id) + "-" + str(tgt_id))
        align_pairs.sort(key=lambda x: int(x.split('-')[-1]))  # sort by tgt_id
        align_pairs.sort(key=lambda x: int(x.split('-')[0]))  # sort by src_id
    return align_pairs

You can check my last PR to see if this helps.

Hi @Zenglinxiao,
I just checked-out your branch and my test code ran without any error- so I think it is safe to say that this PR has fixed the problem.

I will link this comment to the PR so that whoever is reviewing it will know about this.

Thanks for the help.

Regards,
Arbin

Hi @francoishernandez,
I see that this PR has been merged. Will it be possible to release a new version of OpenNMT-py (which includes this change) in pip?

Thanks!

I’d like to finish up a few things and then bump to 1.1.0. Probably by the end of the week.
In the mean time you can pip install git+https://github.com/OpenNMT/OpenNMT-py.

Ok, thanks.

@francoishernandez,
I just noticed that the version has been bumped to 1.1.0. I ran my test code with it and now I am getting the following error:

   batch_size=self.batch_size
  File "/home/atimilsina/daylight-venv/lib/python3.6/site-packages/onmt/translate/translator.py", line 319, in translate
    self.fields.pop('corpus_id')
KeyError: 'corpus_id'

Any idea why that is?

Thanks!

Yes, this ‘corpus_id’ field was added in #1732. This line should be conditioned to retain compatibility with previously trained models. I’ll release a 1.1.1 fixed version.
Thanks for reporting!

Thanks! It works now.

Hi @Zenglinxiao, @francoishernandez,
I came across the following error while running evaluation on one of my test dataset. I am running OpenNMT-py 1.1.1 and using the transformer model trained with guided alignment. For translation, I have replace_unk and report_align set to True. Any idea what this error is and how to fix it:

  File ".../onmt/translate/translation.py", line 51, in _build_target_tokens
    _, max_index = attn[i][:len(src_raw)].max(0)
IndexError: index 12 is out of bounds for dimension 0 with size 7

Thanks!

Hi @Zenglinxiao, @francoishernandez,
If it helps, I figured that this error happens only during evaluation when I pass the -tgt file to the translator.translate(). If I don’t pass the -tgt file, I don’t come across this error.

Also, after some digging, I found that it happens, in my case, for the following source-target combination.

Source: ▁sub ▁v f ▁: ▁né mon e , ▁dr oo , ▁ve en , ▁ ¤ aka ¤
Target: ▁vo ▁by ▁: ▁ ¤ aka ¤

Any idea why this is causing the error and how to fix it?

Thanks,
Arbin

For reference this was solved here.

Hi @Zenglinxiao, @francoishernandez,
I was curious if you have thought about or know of any tools/algorithms to convert the sub-word level alignment to word or phrase level alignment?

Regards,
Arbin

Section 5.1 of this paper is probably what you’re looking for: https://www.aclweb.org/anthology/D19-1453.pdf

Ah, thanks @francoishernandez