Odd validation curve as training batch size changes in OpenNMT-tf 2.18.1

I am seeing some strange behavior during training. I train in many small (~50k sentence) files, running a validation evaluation after a single pass through each file. I use a transformer with relative position. My source inputter is a SequenceRecordInputter, and my target is WordEmbedder. The systems train for days, slowly appearing to converge.

I occasionally had out-of-memory problems (but not on the three cases I’ll show below). I changed the training batch size from the default (about 3000) to 111, and the training runs picked up the batch_size reduction in the configuration yml, as the bash script cycled through training files. Unexpectedly, the validation score went up 1-3 BLEU points, and the Loss dropped dramatically.

I returned to the large batch_size, and validation scores got worse. I repeated this and got another valley in Loss (or peak in BLEU). For each of the nine models I was training!

The figure below has validation loss curves for three models. In each curve you can see the three drops in Loss caused by three decreases in batch size.

image

I don’t understand this behavior. With any nonconvex optimization there’s the chance of local optima and batch size dependence, but I would expect that mostly to affect the validation slope. In particular, the dramatic, repeated drop in quality when I increased batch_size is a mystery to me.

Have others seen or have an explanation for this behavior? Is it perhaps related to a bug in OpenNMT-tf 2.18.1 (“Support” could well be a better category than “Research”)?

Thanks in advance for any ideas,
Gerd

Can you post your training and model configurations?