What exactly is the 'Gold Score'?

I understand that the target sentence is passed through the decoder part of the model and then the output from that step is pass through the generator (liner + softmax function) to obtain log probability- and the gold score is related to this probability. But what exactly does it signify? Should the gold score be better (lower) than the pred score?

Thanks.

1 Like

I think there is no official description about gold score. Normally, the lower gold score we obtain, the better the model fits (or the more closer target side language distribution the model learns)~ and if your model has converged, gold score technically must lower than the pred score

1 Like

Gold score is the log likelihood of the reference that you provided during translation.

1 Like

@alphadl @guillaumekln, thanks!

1 Like

Hi! Apologies, if I should open this as a new post, but probably is fine here. As far as I have seen in the forum looks like both GOLD SCORE and PRED SCORE are the neg cummulated log likehood of the generated sentence. Lets say the system predicts a sentence T that is exactly the same of the GOLD prediction, I would expect GOLD SCORE=PRED SCORE, but this is not happening. I run the following command,

onmt_translate --min_length 2 -mode files/run/model_step_100000.pt --verbose  \
        -log_file files/UNv1.0.testset.fr.tok.tc.sp.10.2en.GOLD.log \
        -src      files/UNv1.0.testset.fr.tok.tc.sp.10 \
	-tgt      files/UNv1.0.testset.en.tok.tc.sp.10 \
        -output   files/UNv1.0.testset.fr.tok.tc.sp.10.2en

Where my model is a transformer, getting good results. But log file displays for instance:

SENT 1: ['<s>', '▁74', '39', 'e', '▁s', 'é', 'ance', '▁@@,', '▁ten', 'ue', '▁le', '▁11', '▁ma', 'i', '▁2015', '</s>']
PRED 1: <s> ▁74 39 th ▁meeting ▁@@, ▁held ▁on ▁11 ▁ ⦅up⦆ ▁may ▁2015 ▁@@.
PRED SCORE: -0.0882
GOLD 1: <s> ▁74 39 th ▁meeting ▁@@, ▁held ▁on ▁11 ▁ ⦅up⦆ ▁may ▁2015 ▁@@.
GOLD SCORE: -1.4114

[2023-12-28 14:07:09,694 INFO] 
SENT 2: ['<s>', '▁', '⦅up⦆', '▁l', "▁@@'@@", '▁', '⦅aup⦆', '▁e', 'i', 'il', '▁a', '▁mis', '▁en', '▁l', 'ign', 'e', '▁des', '▁vid', 'é', 'os', '▁d', 'ans', '▁les', 'qu', 'ell', 'es', '▁on', '▁pe', 'ut', '▁vo', 'ir', '▁des', '▁person', 'nes', '▁sub', 'ir', '▁to', 'ute', '▁une', '▁s', 'é', 'rie', '▁de', '▁ch', 'â', 'ti', 'ments', '▁ab', 'omin', 'ables', '▁:', '▁certain', 'es', '▁é', 'ta', 'ient', '▁lap', 'id', 'é', 'es', '▁@@,', '▁d', "▁@@'@@", '▁aut', 'res', '▁pr', 'é', 'c', 'ip', 'ité', 'es', '▁au', '▁sol', '▁depu', 'is', '▁le', '▁to', 'it', '▁d', "▁@@'@@", '▁un', '▁im', 'me', 'uble', '▁ou', '▁enc', 'ore', '▁dé', 'c', 'ap', 'ité', 'es', '▁ou', '▁cruc', 'if', 'i', 'é', 'es', '▁@@.', '</s>']
PRED 2: <s> ▁ ⦅aup⦆ ▁is il ▁has ▁posted ▁vide os ▁in ▁which ▁people ▁can ▁be ▁seen ▁to ▁be ▁subjected ▁to ▁a ▁wide ▁range ▁of ▁ab h or rent ▁punishments ▁@@: ▁some ▁were ▁st oned ▁@@, ▁others ▁were ▁precip itated ▁on ▁the ▁ground ▁from ▁the ▁roof ▁of ▁a ▁building ▁or ▁were ▁dec ap ed ▁or ▁cruc ified ▁@@.
PRED SCORE: -0.4513
GOLD 2: <s> ▁ ⦅aup⦆ ▁is il ▁itself ▁has ▁published ▁vide os ▁dep ic ting ▁people ▁being ▁subjected ▁to ▁a ▁range ▁of ▁ab h or rent ▁punishments ▁@@, ▁including ▁st oning ▁@@, ▁being ▁pushed ▁@@-@@ ▁off ▁buildings ▁@@, ▁dec ap itation ▁and ▁cruc if ix ion ▁@@.
GOLD SCORE: -86.4680

Notice SENT1 prediction is the same as GOLD 1, should not be the same value ?
Notice SENT2 prediction is quite good, not sure why these differences.

I have seen other logs in the forum, and GOLD score has the same trend (much much lower than PRED SCORE).

If run as the gold file the result of the actual model (PRED sentences=GOLD sentences) any sentence has same score, again different and much much lower values for GOLD.

So, what I am missing here?
Thanks in advance!
M

1 Like