Hi @Zenglinxiao,
I could filter out these edge cases on my end, but having a fix in OpenNMT-py’s side will benefit everyone else. Just wondering if you had a chance to think about how to fix this blank prediction issue. Let me know if I can be of any help.
Thanks!
Hello @ArbinTimilsina,
A quick fix for this could be:
def extract_alignment(align_matrix, tgt_mask, src_lens, n_best):
...
valid_tgt_len = valid_tgt.sum()
if valid_tgt_len == 0:
valid_alignment = None
else:
# get valid alignment (sub-matrix from full paded aligment matrix)
am_valid_tgt = am_b.masked_select(valid_tgt.unsqueeze(-1)) \
.view(valid_tgt_len, -1)
valid_alignment = am_valid_tgt[:, :src_len] # only keep valid src
alignments[i // n_best].append(valid_alignment)
...
But not sure why batch_size
to 1 not envoke the blank prediction issue. In theory, when no valid tgt token, the .view(0, -1)
will always raise un Error. Maybe they just don’t have blank prediction when batch_size
== 1 ?
Do you have any idea @francoishernandez?
Could this be linked to #1320 and #1547?
@Zenglinxiao,
I tried your recomendaiton but now I am getting
File "/home/atimilsina/Work/preprocessing-training-with-alignment/OpenNMT-py/onmt/translate/translator.py", line 368, in translate
in trans.word_aligns[:self.n_best]]
File "/home/atimilsina/Work/preprocessing-training-with-alignment/OpenNMT-py/onmt/translate/translator.py", line 367, in <listcomp>
align_pharaohs = [build_align_pharaoh(align) for align
File "/home/atimilsina/Work/preprocessing-training-with-alignment/OpenNMT-py/onmt/utils/alignment.py", line 72, in build_align_pharaoh
tgt_align_src_id = valid_alignment.argmax(dim=-1)
AttributeError: 'NoneType' object has no attribute 'argmax'
Any idea what would be appropriate for valid_alignment
rather than None
?
@ArbinTimilsina, thanks for the feedback.
I don’t think we can find a default tensor as a replace of None, as in this case, it indeed will not have any alignment if the target is empty.
To solve the build_align_pharaoh
problem, we just need to do the same thing:
def build_align_pharaoh(valid_alignment):
align_pairs = []
if isinstance(valid_alignment, torch.Tensor):
tgt_align_src_id = valid_alignment.argmax(dim=-1)
for tgt_id, src_id in enumerate(tgt_align_src_id.tolist()):
align_pairs.append(str(src_id) + "-" + str(tgt_id))
align_pairs.sort(key=lambda x: int(x.split('-')[-1])) # sort by tgt_id
align_pairs.sort(key=lambda x: int(x.split('-')[0])) # sort by src_id
return align_pairs
You can check my last PR to see if this helps.
Hi @Zenglinxiao,
I just checked-out your branch and my test code ran without any error- so I think it is safe to say that this PR has fixed the problem.
I will link this comment to the PR so that whoever is reviewing it will know about this.
Thanks for the help.
Regards,
Arbin
Hi @francoishernandez,
I see that this PR has been merged. Will it be possible to release a new version of OpenNMT-py (which includes this change) in pip?
Thanks!
I’d like to finish up a few things and then bump to 1.1.0. Probably by the end of the week.
In the mean time you can pip install git+https://github.com/OpenNMT/OpenNMT-py
.
@francoishernandez,
I just noticed that the version has been bumped to 1.1.0. I ran my test code with it and now I am getting the following error:
batch_size=self.batch_size
File "/home/atimilsina/daylight-venv/lib/python3.6/site-packages/onmt/translate/translator.py", line 319, in translate
self.fields.pop('corpus_id')
KeyError: 'corpus_id'
Any idea why that is?
Thanks!
Yes, this ‘corpus_id’ field was added in #1732. This line should be conditioned to retain compatibility with previously trained models. I’ll release a 1.1.1 fixed version.
Thanks for reporting!
Hi @Zenglinxiao, @francoishernandez,
I came across the following error while running evaluation on one of my test dataset. I am running OpenNMT-py 1.1.1 and using the transformer model trained with guided alignment. For translation, I have replace_unk
and report_align
set to True. Any idea what this error is and how to fix it:
File ".../onmt/translate/translation.py", line 51, in _build_target_tokens
_, max_index = attn[i][:len(src_raw)].max(0)
IndexError: index 12 is out of bounds for dimension 0 with size 7
Thanks!
Hi @Zenglinxiao, @francoishernandez,
If it helps, I figured that this error happens only during evaluation when I pass the -tgt
file to the translator.translate()
. If I don’t pass the -tgt
file, I don’t come across this error.
Also, after some digging, I found that it happens, in my case, for the following source-target combination.
Source: ▁sub ▁v f ▁: ▁né mon e , ▁dr oo , ▁ve en , ▁ ¤ aka ¤
Target: ▁vo ▁by ▁: ▁ ¤ aka ¤
Any idea why this is causing the error and how to fix it?
Thanks,
Arbin
For reference this was solved here.
Hi @Zenglinxiao, @francoishernandez,
I was curious if you have thought about or know of any tools/algorithms to convert the sub-word level alignment to word or phrase level alignment?
Regards,
Arbin
Section 5.1 of this paper is probably what you’re looking for: https://www.aclweb.org/anthology/D19-1453.pdf