Issues reproducing older model

Hi! I’m trying to reproduce a model trained with v0.9.2 (~2019) with a new one trained on v2.3.0. There are significant differences in the models, and the newer model performs worse (mostly pass/fail human evaluation, but also Rouge and other metrics). The biggest issue is probably some error in my preprocessing which I’m still working on, but I’ve been at it for a few weeks now without much progress (and lots of passing unit tests :expressionless:), so I’m trying to eliminate a few more potential explanations. I would appreciate any input regarding the following. I’ve been using mostly the same parameters minus what’s not supported. The notable differences between the two models are these:

  1. Training on 4 GPUs instead of 1
  2. Training with different batch sizes (due to seemingly random CUDA memory errors)
  3. Training with a much newer version of OpenNMT
  4. Training with interruptions, which means training often gets needs to continue from a checkpoint

I’m not sure about how any of these affect the results and I’m not sure if isolating them is possible, cost-wise. For the first two, I didn’t look any further because I don’t think they could produce as big a difference as I’m seeing.

For #3, the first thing that came to mind is the 2.0.0 message in the changelog, but I guess that would have made the model better.

For #4, my fear is whether the training goes over all of the data uniformly or not (samples=~800k, steps=~1.2M steps, batch size=16, gpus=4). Interruptions usually happen around every 200k steps but may happen way more often. I looked a bit into older posts but there seems to be some contradictory information. As mentioned here (2019):

Examples are randomly sampled from the full data. So the first 18,000 training steps saw examples from around the corpus, and not just the 18,000*32 first examples.

But here (2021) it seems that there is a different answer:

[…] It’s not ideal though since iteration on the datasets will start from scratch […]

Thanks!

Couldn’t add a third link. Here’s the changelog message I was referring to: OpenNMT-py/CHANGELOG.md at master · OpenNMT/OpenNMT-py · GitHub