Why is <UNK> in GOLD?

Why am I getting UNK in GOLD during onmt-py translate:

SENT 2550: [‘But’, ‘the’, ‘latest’, ‘simulation’, ‘suggests’, ‘Venus’, ‘could’, ‘have’, ‘boasted’, ‘a’, ‘thin’, ‘■,’, ‘Earth’, ‘■-■’, ‘like’, ‘atmosphere’, ‘and’, ‘still’, ‘spun’, ‘slowly’, ‘■.’]
PRED 2550: Aber die jüngste Simulation deutet darauf hin ■, dass Venus eine dünne ■, Earth Atmosphäre boasted und sich immer noch langsam dreht ■.
PRED SCORE: -14.8150
GOLD 2550: Aber die neueste Simulation deutet darauf hin ■, dass die Venus eine dünne ■, unk Atmosphäre gehabt haben könnte und sich dennoch langsam unk ■.

Q1. My pred even tries to predict it! But how can BLEU possibly compare it if the GOLD is UNK?
Q2. And, please, where is the help file on using translate (with details like this explained)?

Q1. I reckon you don’t have all the tokens in your vocab. Hence, when rebuilding the gold sent here the OOV word is mapped to the unknown token.
You can verify that by looking at your vocab.pt as mentioned here.

Q2. I’m not sure there is any additional doc about all the translation options other than this. Your best bet if you have specific interrogations is to search on the forum and in the Github issues.

1 Like

Right. OK, and it’s because I’m experimenting with NON-sub-word tokenisation and I’m hitting the vocab maximum setting . .

That explains everything.

Thanks FH.