Best way to handle emojis during translation

Hi @Zenglinxiao,
I could filter out these edge cases on my end, but having a fix in OpenNMT-py’s side will benefit everyone else. Just wondering if you had a chance to think about how to fix this blank prediction issue. Let me know if I can be of any help.

Thanks!

Hello @ArbinTimilsina,
A quick fix for this could be:

def extract_alignment(align_matrix, tgt_mask, src_lens, n_best):
...
        valid_tgt_len = valid_tgt.sum()
        if valid_tgt_len == 0:
            valid_alignment = None
        else:
            # get valid alignment (sub-matrix from full paded aligment matrix)
            am_valid_tgt = am_b.masked_select(valid_tgt.unsqueeze(-1)) \
                               .view(valid_tgt_len, -1)
            valid_alignment = am_valid_tgt[:, :src_len]  # only keep valid src
        alignments[i // n_best].append(valid_alignment)
...

But not sure why batch_size to 1 not envoke the blank prediction issue. In theory, when no valid tgt token, the .view(0, -1) will always raise un Error. Maybe they just don’t have blank prediction when batch_size == 1 ?
Do you have any idea @francoishernandez?

Could this be linked to #1320 and #1547?

@Zenglinxiao,
I tried your recomendaiton but now I am getting

File "/home/atimilsina/Work/preprocessing-training-with-alignment/OpenNMT-py/onmt/translate/translator.py", line 368, in translate
    in trans.word_aligns[:self.n_best]]
  File "/home/atimilsina/Work/preprocessing-training-with-alignment/OpenNMT-py/onmt/translate/translator.py", line 367, in <listcomp>
    align_pharaohs = [build_align_pharaoh(align) for align
  File "/home/atimilsina/Work/preprocessing-training-with-alignment/OpenNMT-py/onmt/utils/alignment.py", line 72, in build_align_pharaoh
    tgt_align_src_id = valid_alignment.argmax(dim=-1)
AttributeError: 'NoneType' object has no attribute 'argmax'

Any idea what would be appropriate for valid_alignment rather than None?

@ArbinTimilsina, thanks for the feedback.
I don’t think we can find a default tensor as a replace of None, as in this case, it indeed will not have any alignment if the target is empty.
To solve the build_align_pharaoh problem, we just need to do the same thing:

def build_align_pharaoh(valid_alignment):
    align_pairs = []
    if isinstance(valid_alignment, torch.Tensor):
        tgt_align_src_id = valid_alignment.argmax(dim=-1)

        for tgt_id, src_id in enumerate(tgt_align_src_id.tolist()):
            align_pairs.append(str(src_id) + "-" + str(tgt_id))
        align_pairs.sort(key=lambda x: int(x.split('-')[-1]))  # sort by tgt_id
        align_pairs.sort(key=lambda x: int(x.split('-')[0]))  # sort by src_id
    return align_pairs

You can check my last PR to see if this helps.

Hi @Zenglinxiao,
I just checked-out your branch and my test code ran without any error- so I think it is safe to say that this PR has fixed the problem.

I will link this comment to the PR so that whoever is reviewing it will know about this.

Thanks for the help.

Regards,
Arbin

Hi @francoishernandez,
I see that this PR has been merged. Will it be possible to release a new version of OpenNMT-py (which includes this change) in pip?

Thanks!

I’d like to finish up a few things and then bump to 1.1.0. Probably by the end of the week.
In the mean time you can pip install git+https://github.com/OpenNMT/OpenNMT-py.

Ok, thanks.

@francoishernandez,
I just noticed that the version has been bumped to 1.1.0. I ran my test code with it and now I am getting the following error:

   batch_size=self.batch_size
  File "/home/atimilsina/daylight-venv/lib/python3.6/site-packages/onmt/translate/translator.py", line 319, in translate
    self.fields.pop('corpus_id')
KeyError: 'corpus_id'

Any idea why that is?

Thanks!

Yes, this ‘corpus_id’ field was added in #1732. This line should be conditioned to retain compatibility with previously trained models. I’ll release a 1.1.1 fixed version.
Thanks for reporting!

Thanks! It works now.

Hi @Zenglinxiao, @francoishernandez,
I came across the following error while running evaluation on one of my test dataset. I am running OpenNMT-py 1.1.1 and using the transformer model trained with guided alignment. For translation, I have replace_unk and report_align set to True. Any idea what this error is and how to fix it:

  File ".../onmt/translate/translation.py", line 51, in _build_target_tokens
    _, max_index = attn[i][:len(src_raw)].max(0)
IndexError: index 12 is out of bounds for dimension 0 with size 7

Thanks!

Hi @Zenglinxiao, @francoishernandez,
If it helps, I figured that this error happens only during evaluation when I pass the -tgt file to the translator.translate(). If I don’t pass the -tgt file, I don’t come across this error.

Also, after some digging, I found that it happens, in my case, for the following source-target combination.

Source: ▁sub ▁v f ▁: ▁né mon e , ▁dr oo , ▁ve en , ▁ ¤ aka ¤
Target: ▁vo ▁by ▁: ▁ ¤ aka ¤

Any idea why this is causing the error and how to fix it?

Thanks,
Arbin

For reference this was solved here.

Hi @Zenglinxiao, @francoishernandez,
I was curious if you have thought about or know of any tools/algorithms to convert the sub-word level alignment to word or phrase level alignment?

Regards,
Arbin

Section 5.1 of this paper is probably what you’re looking for: https://www.aclweb.org/anthology/D19-1453.pdf

Ah, thanks @francoishernandez