Increasing effective batch size

Hello! I’d be grateful if someone could give answers or some explanation to the following two questions:

  • how increasing the parameter effective batch size affects the final result of a trained model (it gets worse, better or does not affect at all)? Currently using effective batch size = 25 000
  • if I increase effective batch size to 80 000 and train model for 65 000 steps will it be approximately equivalently to training model with effective batch size = 25 000 and train steps 200 000?
    Many thanks in advance!


  • For Transformer training, increasing the effective batch size can improve the final results and/or converge faster. See for example which uses an effective batch size of 400k (even > 600k for the ENFR training).
  • No these are 2 different trainings. They will see approximately the same amount of data, but not under the same learning rate regime for example.

Thank for your reply!


To add to the question… what are the negatives impact of increasing the effective batch size?

Best regards,

I can’t think of any negative impacts.

However, for very large effective batch sizes a single step will take more time. So you might need to tune other parameters based on the step number: logging frequency, checkpoint saving frequency, etc.