Hello,
I see discussions about this for OpenNMT-py, but could not find anything for OpenNMT-tf. I built a baseline model, I fine-tuned it, and ran evaluation on the test dataset. I realized I only exported the CTransalte2 model, but forgot to save the TF model, so I had to repeat the exact same experiment again. This was when I noticed that results of translation on either the dev dataset during validation, or the test dataset are slightly different (e.g. 0.4+ BLEU) from the previous experiment. I ran it for a third time, and this was also different. Notice here that everything is the same; the validation was compared to the same steps, and the CTranslate2 model was exported at the same checkpoint and on the same hardware.
Please note also that I use Relative Position Representations (in all the experiments). So I wonder if there is a missing seed value somewhere, or is this normal for some other reason?
The main reason I am asking is that I want to calculate the Statistical Significance of the results. However, if the results change by 0.4+ BLEU just by re-running the same experiments, I guess Statistical Significance here will be meaningless if the improvement between experiments was small. The good news is that the improvement I have is much more than this, but I still need to understand this point, if possible.
I will highly appreciate clarifying this matter. I will be happy to share more details, if needed. Many thanks!
Kind regards,
Yasmin