Hi! Apologies, if I should open this as a new post, but probably is fine here. As far as I have seen in the forum looks like both GOLD SCORE and PRED SCORE are the neg cummulated log likehood of the generated sentence. Lets say the system predicts a sentence T that is exactly the same of the GOLD prediction, I would expect GOLD SCORE=PRED SCORE, but this is not happening. I run the following command,
onmt_translate --min_length 2 -mode files/run/model_step_100000.pt --verbose \
-log_file files/UNv1.0.testset.fr.tok.tc.sp.10.2en.GOLD.log \
-src files/UNv1.0.testset.fr.tok.tc.sp.10 \
-tgt files/UNv1.0.testset.en.tok.tc.sp.10 \
-output files/UNv1.0.testset.fr.tok.tc.sp.10.2en
Where my model is a transformer, getting good results. But log file displays for instance:
SENT 1: ['<s>', '▁74', '39', 'e', '▁s', 'é', 'ance', '▁@@,', '▁ten', 'ue', '▁le', '▁11', '▁ma', 'i', '▁2015', '</s>']
PRED 1: <s> ▁74 39 th ▁meeting ▁@@, ▁held ▁on ▁11 ▁ ⦅up⦆ ▁may ▁2015 ▁@@.
PRED SCORE: -0.0882
GOLD 1: <s> ▁74 39 th ▁meeting ▁@@, ▁held ▁on ▁11 ▁ ⦅up⦆ ▁may ▁2015 ▁@@.
GOLD SCORE: -1.4114
[2023-12-28 14:07:09,694 INFO]
SENT 2: ['<s>', '▁', '⦅up⦆', '▁l', "▁@@'@@", '▁', '⦅aup⦆', '▁e', 'i', 'il', '▁a', '▁mis', '▁en', '▁l', 'ign', 'e', '▁des', '▁vid', 'é', 'os', '▁d', 'ans', '▁les', 'qu', 'ell', 'es', '▁on', '▁pe', 'ut', '▁vo', 'ir', '▁des', '▁person', 'nes', '▁sub', 'ir', '▁to', 'ute', '▁une', '▁s', 'é', 'rie', '▁de', '▁ch', 'â', 'ti', 'ments', '▁ab', 'omin', 'ables', '▁:', '▁certain', 'es', '▁é', 'ta', 'ient', '▁lap', 'id', 'é', 'es', '▁@@,', '▁d', "▁@@'@@", '▁aut', 'res', '▁pr', 'é', 'c', 'ip', 'ité', 'es', '▁au', '▁sol', '▁depu', 'is', '▁le', '▁to', 'it', '▁d', "▁@@'@@", '▁un', '▁im', 'me', 'uble', '▁ou', '▁enc', 'ore', '▁dé', 'c', 'ap', 'ité', 'es', '▁ou', '▁cruc', 'if', 'i', 'é', 'es', '▁@@.', '</s>']
PRED 2: <s> ▁ ⦅aup⦆ ▁is il ▁has ▁posted ▁vide os ▁in ▁which ▁people ▁can ▁be ▁seen ▁to ▁be ▁subjected ▁to ▁a ▁wide ▁range ▁of ▁ab h or rent ▁punishments ▁@@: ▁some ▁were ▁st oned ▁@@, ▁others ▁were ▁precip itated ▁on ▁the ▁ground ▁from ▁the ▁roof ▁of ▁a ▁building ▁or ▁were ▁dec ap ed ▁or ▁cruc ified ▁@@.
PRED SCORE: -0.4513
GOLD 2: <s> ▁ ⦅aup⦆ ▁is il ▁itself ▁has ▁published ▁vide os ▁dep ic ting ▁people ▁being ▁subjected ▁to ▁a ▁range ▁of ▁ab h or rent ▁punishments ▁@@, ▁including ▁st oning ▁@@, ▁being ▁pushed ▁@@-@@ ▁off ▁buildings ▁@@, ▁dec ap itation ▁and ▁cruc if ix ion ▁@@.
GOLD SCORE: -86.4680
Notice SENT1 prediction is the same as GOLD 1, should not be the same value ?
Notice SENT2 prediction is quite good, not sure why these differences.
I have seen other logs in the forum, and GOLD score has the same trend (much much lower than PRED SCORE).
If run as the gold file the result of the actual model (PRED sentences=GOLD sentences) any sentence has same score, again different and much much lower values for GOLD.
So, what I am missing here?
Thanks in advance!
M