Weighted dataset

SamuelLacombe · January 12, 2023, 2:02am

Hello,

I want to train a model with multiple languages. To do so, I have the Source and Target for each language in individual files. Some languages have files much, much bigger than others. Based on the description on the website:

This configuration will create a weighted dataset where examples will be randomly sampled from the data files according to the provided weights. The weights are normalized by the file size so that examples from small files are not repeated more often than examples from large files during the training.

Should I set the weights all even since there is normalization? Or give more weight to the smaller files to counterbalance the normalization?

I want to make sure the model doesn’t focus too much on the languages with more data.

Thank you,
Samuel

guillaumekln · January 12, 2023, 10:24am

Hi,

If you want the training to see the same number of lines from each file, you should give more weights to smaller files.

For example, if corpus A is 2 times smaller than corpus B, you could give weight 2 to corpus A and weight 1 to corpus B.

liu663 · March 21, 2023, 1:27pm

How to give different weights to different corpora in the script file, I didn’t find it in the tutorial, if you can help answer, it will be very grateful！

SamuelLacombe · March 22, 2023, 2:55am

Hello,

I did not have the time to experience it yet, but search for “train_files_weights” on the forum you will find few examples!

Best regards,
Samuel