Using weighted datasets

panosk · February 19, 2021, 11:40am

Hello,

I would like to use weighted datasets to clean up a bit my domain adaptation pipeline. Could you please share the normalization formula that gives the actual number of sentences from each dataset? And is it possible to oversample, as in the nmt-wizard (e.g *2)?
Also, when setting weighted datasets, I also have to explicitly set sample_buffer_size to 0, otherwise a ValueError is raised. Maybe this could be set automatically.

panos345 · February 19, 2021, 12:48pm

@panosk
Καλησπέρα.
Δεν έχει σχέση αυτό που θα πω με την ερώτηση σου, αλλά δεν ξέρω πως αλλιώς να επικοινωνήσω. Μήπως ξέρεις αν υπάρχει κάπου έτοιμο μοντέλο για μεταφράσεις κειμένων από Αγγλικά και Γαλλικά σε ελληνικά;
Ευχαριστώ.

guillaumekln · February 19, 2021, 12:58pm

Hi,

The final value is not a number of sentences but a probability to pick a sentence from a dataset.

Best is to look at the code and tests:

github.com

OpenNMT/OpenNMT-tf/blob/v2.15.0/opennmt/data/dataset.py#L70-L76


# Weights should be normalized by the dataset size relative to the total size.
total_size = sum(sizes)
weights = [weight * (size / total_size) for weight, size in zip(weights, sizes)]
# Convert weights to probabilities.
logits = tf.math.log(tf.constant(weights, dtype=tf.float32))
probabilities = tf.nn.softmax(logits).numpy().tolist()

In this test there are 2 datasets with 4 lines and 2 lines.

In the first case the datasets have the same weight and the probability is the same as sampling uniformly from all data.
In the second case, the second dataset with 2 lines has double the weight of the first dataset with 4 lines: the examples of the second dataset will be seen 2 times more often than the first dataset.

There is no direct equivalent to the syntax you are referring to. But as mentioned above, setting a weight that is the double of another weight is the same as making the examples appear 2 times more often during the training.

The default value of sample_buffer_size is -1 which means “the dataset size”. But in the case of weighted dataset, there is no total dataset size. You should set a constant value but large enough to correctly shuffle the examples. For example set sample_buffer_size to 5000000, but not 0 which disables shuffling. We could indeed do that automatically.

panosk · February 19, 2021, 1:09pm

Thank you!
Btw, according to the docs, the default sample_buffer_size is set to 50000.

guillaumekln · February 19, 2021, 1:13pm

Yes, that’s correct but not when using --auto_config which comes with another set of defaults which are better for the task and model.

panosk · February 19, 2021, 2:24pm

Hi @panos345 ,

Check in your forum private messages.