I want to train a model with multiple languages. To do so, I have the Source and Target for each language in individual files. Some languages have files much, much bigger than others. Based on the description on the website:
This configuration will create a weighted dataset where examples will be randomly sampled from the data files according to the provided weights. The weights are normalized by the file size so that examples from small files are not repeated more often than examples from large files during the training.
Should I set the weights all even since there is normalization? Or give more weight to the smaller files to counterbalance the normalization?
I want to make sure the model doesn’t focus too much on the languages with more data.