UNK replacement

< Inbound >< References > < Ref1 > AA212 < /Ref1 > < Ref2 / > < Ref3 / > < Ref4 / > < Ref5 / > < Ref6 / > < MessageId > 12345678 < /MessageId > < TransactionId > 23456789 < /TransactionId >…up to 359 “words”

It looks “good” without -replace_unk:
< Results > < Ref1 > GC08 < /Ref1 > < Ref2 / > < Ref3 / > < Ref4 / > < Ref5 / > < Ref6 / > < MessageId > < unk > < /MessageId > < TransactionId > < unk > < /TransactionId >…up to 409 “words”

…but with replace_unk it takes the word pair < Inbound >< References > that is in the src dictionary and happily replaces all < unk > tokens with that:
< Results > < Ref1 > AA212 < /Ref1 > < Ref2 / > < Ref3 / > < Ref4 / > < Ref5 / > < Ref6 / > < MessageId > < Inbound >< References > < /MessageId > < TransactionId > < Inbound >< References > < /TransactionId > …upto 409 “words”

Is this expected behavior or am I doing something wrong?

Model has been created by using command: th train.lua -data /home/ari/master-data/vocab120-163-train.t7 -save_model cv/1400x6 -epochs 100 -rnn_size 1400 -word_vec_size 1400 -layers 6 -gpuid 3 -report_every 5 -max_batch_size 10 -learning_rate 0.3

Model details are:
Loading data from ‘/home/ari/master-data/vocab120-163-train.t7’…

  • vocabulary size: source = 124; target = 167
  • additional features: source = 0; target = 0
  • maximum sequence length: source = 358; target = 409
  • number of training sentences: 5154
  • maximum batch size: 10
    Building model…
  • using input feeding
    Initializing parameters…
  • number of parameters: 202588567

EDIT: had to add lots of space for XML to display

-replace_unk blindly replaces unknown tokens by source tokens with the highest attention score. See the guide:

http://opennmt.net/Guide/#translating-unk-words

Hmm. As such it seems to be quite useless then. A minimum would be that it should take UNK tokens from source and map (using attention) those UNK tokens to target. Now in this case it simply seems to takes 1st token found which is not even UNK and maps it. Are you sure this is how it is supposed to work?

As described in the guide, it is useful for text translation:

Often times UNK symbols will correspond to proper names that can be directly transposed between languages.

In your case, it takes the first token because you don’t have a lot of data to make the attention layer learn correctly. With large scale training, attentions can be seen as alignments so it will copy the source word that is supposed to be translated.

I know this is no longer state of the art, but for paedagogical purposes I wanted to build a basic rnn mt model and then gradually improve it. The first thing to improve it would be to use the -replace_unk switch when using onmt_translate. Is it possible that this is broken in version 3? I get
Traceback (most recent call last): File "/opt/conda/bin/onmt_translate", line 8, in <module> sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/onmt/bin/translate.py", line 60, in main translate(opt) File "/opt/conda/lib/python3.10/site-packages/onmt/bin/translate.py", line 41, in translate _, _ = translator._translate( File "/opt/conda/lib/python3.10/site-packages/onmt/translate/translator.py", line 348, in _translate translations = xlation_builder.from_batch(batch_data) File "/opt/conda/lib/python3.10/site-packages/onmt/translate/translation.py", line 92, in from_batch pred_sents = [self._build_target_tokens( File "/opt/conda/lib/python3.10/site-packages/onmt/translate/translation.py", line 92, in <listcomp> pred_sents = [self._build_target_tokens( File "/opt/conda/lib/python3.10/site-packages/onmt/translate/translation.py", line 52, in _build_target_tokens _, max_index = attn[i][:len(src_raw)].max(0) TypeError: object of type 'NoneType' has no len()