I guess it takes 23 examples from commoncrawl, 19 examples from europarl and 3 from news_commentary.
If batch size is 128, then it is repeated many times?
Yes. The batch building process is basically an infinite loop.
What if it is not multiple?
Batch building and weighting is not strict. See the part about
How can I give equal weight to all corpora?
In the sense that every corpus will be seen equally frequently? Just set all weights to 1. It will then sample one example per corpus iteratively.
What if some corpus is small, will it be seen in training more than once before the biggest corpus is completely seen in training?
Yes. If you want to replicate the ‘concatenation’, you can give some approximate weights based on the size of your datasets. E.g. if dataset A is 10 times bigger than dataset B, then weight A = 10 and weight B = 1.