Weighted dataset


I want to train a model with multiple languages. To do so, I have the Source and Target for each language in individual files. Some languages have files much, much bigger than others. Based on the description on the website:

This configuration will create a weighted dataset where examples will be randomly sampled from the data files according to the provided weights. The weights are normalized by the file size so that examples from small files are not repeated more often than examples from large files during the training.

Should I set the weights all even since there is normalization? Or give more weight to the smaller files to counterbalance the normalization?

I want to make sure the model doesn’t focus too much on the languages with more data.

Thank you,


If you want the training to see the same number of lines from each file, you should give more weights to smaller files.

For example, if corpus A is 2 times smaller than corpus B, you could give weight 2 to corpus A and weight 1 to corpus B.

1 Like