UNK replacement


(Ari Juntunen) #1

< Inbound >< References > < Ref1 > AA212 < /Ref1 > < Ref2 / > < Ref3 / > < Ref4 / > < Ref5 / > < Ref6 / > < MessageId > 12345678 < /MessageId > < TransactionId > 23456789 < /TransactionId >…up to 359 “words”

It looks “good” without -replace_unk:
< Results > < Ref1 > GC08 < /Ref1 > < Ref2 / > < Ref3 / > < Ref4 / > < Ref5 / > < Ref6 / > < MessageId > < unk > < /MessageId > < TransactionId > < unk > < /TransactionId >…up to 409 “words”

…but with replace_unk it takes the word pair < Inbound >< References > that is in the src dictionary and happily replaces all < unk > tokens with that:
< Results > < Ref1 > AA212 < /Ref1 > < Ref2 / > < Ref3 / > < Ref4 / > < Ref5 / > < Ref6 / > < MessageId > < Inbound >< References > < /MessageId > < TransactionId > < Inbound >< References > < /TransactionId > …upto 409 “words”

Is this expected behavior or am I doing something wrong?

Model has been created by using command: th train.lua -data /home/ari/master-data/vocab120-163-train.t7 -save_model cv/1400x6 -epochs 100 -rnn_size 1400 -word_vec_size 1400 -layers 6 -gpuid 3 -report_every 5 -max_batch_size 10 -learning_rate 0.3

Model details are:
Loading data from ‘/home/ari/master-data/vocab120-163-train.t7’…

  • vocabulary size: source = 124; target = 167
  • additional features: source = 0; target = 0
  • maximum sequence length: source = 358; target = 409
  • number of training sentences: 5154
  • maximum batch size: 10
    Building model…
  • using input feeding
    Initializing parameters…
  • number of parameters: 202588567

EDIT: had to add lots of space for XML to display


(Guillaume Klein) #2

-replace_unk blindly replaces unknown tokens by source tokens with the highest attention score. See the guide:

http://opennmt.net/Guide/#translating-unk-words


(Ari Juntunen) #3

Hmm. As such it seems to be quite useless then. A minimum would be that it should take UNK tokens from source and map (using attention) those UNK tokens to target. Now in this case it simply seems to takes 1st token found which is not even UNK and maps it. Are you sure this is how it is supposed to work?


(Guillaume Klein) #4

As described in the guide, it is useful for text translation:

Often times UNK symbols will correspond to proper names that can be directly transposed between languages.

In your case, it takes the first token because you don’t have a lot of data to make the attention layer learn correctly. With large scale training, attentions can be seen as alignments so it will copy the source word that is supposed to be translated.