Training convergence and beam size impact

Hi all,

I’d like some insight about two observations:

After training several models of different corpus sizes, ranging from 3m sentences to 60m sentences, I can safely conclude that there is a “standard” convergence limit of around 100k steps for base transformers and around 150k for big transformers, no matter the corpus size… Gains after that point are minimal and definitely very cost ineffective in terms of time and power consumption. I thought that with larger corpora, training would take longer to converge, but it seems that is not the case. I suspect some parameter fine tuning may be required, so I’d appreciate any insight on that. Currently I use the default config parameters, with a batch size of 1586 for big transformers.

The second, rather surprising observation, is the huge impact of beam size during inference (with CTranslate2) for only some language pairs. For instance, changing the beam size for Engligh->Greek translation has minimal impact, but changing it for Italian->Greek (and to a lesser extend for English->Italian) has a large impact. In the second case, CTranslate2’s default beam size of 2 yields poor results but increasing it to 10-12 improves drastically the translation. Also, can the beam size affect drastically the evaluation during training, since a relative small beam size is used? Is there any information or a published paper related to that topic?

Thanks!

1 Like

Hi Panos,
For base transformers I have found exactly that - little apparent benefit from going above 100K steps.

Hey Panos!
With which metric do you define convergence?
Some transformer studies illustrate that a bigger corpus takes longer to convert. The description of Roberta even states that training longer then convergence yields better results. On the other hand you get practical results with early stopping and iterative back-translation. A bigger batch size can also result in a faster convergence and overall better quality. Have a look at kaggle or half and quantize the precision, if your gpu memory is limited.

Also, can the beam size affect drastically the evaluation during training, since a relative small beam size is used?

Most of the time a bigger beam size achieves better results, but also increases the search space and computation time. It is a good start to tune it on a specific model and requirements, furthermore there are beam search strategies.

Greetings from the translation space

3 Likes

Hi @Bachstelze,

Thanks for your input. Actually I’ve been using the “Training Tips for the Transformer Model” study which I find very useful. I use bleu as a metric. The batch size is the optimal for my GPUs (2x RTX 2080 Ti), after that I get out of memory errors. As the study mentions, larger corpus should take longer to converge, but it seems this isn’t noticable with the default parameters, and that’s my question.

Concerning the beam size, it seems indeed that it needs to be fine-tuned per model/language pair. It would be really useful though if an optimization strategy could be implemented during training (if I recall correctly, it’s the beam_width parameter in opennmt-tf). For example, after a few scorings, there could be a special step that would tune the beam width to an optimal value and the subsequent evaluations would continue with this value. Just an idea :slight_smile:

Have a look at the Scaling Laws for Neural Language Models and for the Dimension of the Data Manifold. Some models aren’t trained on a hole epoche.
So a big corpus offers many possibilities with a good performance. You could filter it into specific domains and balance a multi domain model.
How many epochs do you train till convergence? Do you use back-translation and mixed precision?

For example, after a few scorings, there could be a special step that would tune the beam width to an optimal value and the subsequent evaluations would continue with this value

Yes, this idea would prevent misleading testing.

Argos Translate uses a minimum beam size 4.