Different results with multiple training runs

ymoslem · May 11, 2022, 1:44pm

Hello,

I see discussions about this for OpenNMT-py, but could not find anything for OpenNMT-tf. I built a baseline model, I fine-tuned it, and ran evaluation on the test dataset. I realized I only exported the CTransalte2 model, but forgot to save the TF model, so I had to repeat the exact same experiment again. This was when I noticed that results of translation on either the dev dataset during validation, or the test dataset are slightly different (e.g. 0.4+ BLEU) from the previous experiment. I ran it for a third time, and this was also different. Notice here that everything is the same; the validation was compared to the same steps, and the CTranslate2 model was exported at the same checkpoint and on the same hardware.

Please note also that I use Relative Position Representations (in all the experiments). So I wonder if there is a missing seed value somewhere, or is this normal for some other reason?

The main reason I am asking is that I want to calculate the Statistical Significance of the results. However, if the results change by 0.4+ BLEU just by re-running the same experiments, I guess Statistical Significance here will be meaningless if the improvement between experiments was small. The good news is that the improvement I have is much more than this, but I still need to understand this point, if possible.

I will highly appreciate clarifying this matter. I will be happy to share more details, if needed. Many thanks!

Kind regards,
Yasmin

guillaumekln · May 16, 2022, 8:29am

Hi,

The random seed is not fixed by default so getting different results is expected. However, the difference should get smaller as you train for longer and with more data. For how many steps did you train these models?

You can set the --seed option to get the same random initialization and batch ordering, but even in this mode the results are not fully reproducible because GPU computation is not deterministic.

ymoslem · May 16, 2022, 10:37am

Many thanks, Guillaume!

It is fine-tuning. Here it was for only 5000 steps.

Kind regards,
Yasmin