I’d like some insight about two observations:
After training several models of different corpus sizes, ranging from 3m sentences to 60m sentences, I can safely conclude that there is a “standard” convergence limit of around 100k steps for base transformers and around 150k for big transformers, no matter the corpus size… Gains after that point are minimal and definitely very cost ineffective in terms of time and power consumption. I thought that with larger corpora, training would take longer to converge, but it seems that is not the case. I suspect some parameter fine tuning may be required, so I’d appreciate any insight on that. Currently I use the default config parameters, with a batch size of 1586 for big transformers.
The second, rather surprising observation, is the huge impact of beam size during inference (with CTranslate2) for only some language pairs. For instance, changing the beam size for Engligh->Greek translation has minimal impact, but changing it for Italian->Greek (and to a lesser extend for English->Italian) has a large impact. In the second case, CTranslate2’s default beam size of 2 yields poor results but increasing it to 10-12 improves drastically the translation. Also, can the beam size affect drastically the evaluation during training, since a relative small beam size is used? Is there any information or a published paper related to that topic?