I would like to use weighted datasets to clean up a bit my domain adaptation pipeline. Could you please share the normalization formula that gives the actual number of sentences from each dataset? And is it possible to oversample, as in the nmt-wizard (e.g *2)?
Also, when setting weighted datasets, I also have to explicitly set sample_buffer_size to 0, otherwise a ValueError is raised. Maybe this could be set automatically.
@panosk
Καλησπέρα.
Δεν έχει σχέση αυτό που θα πω με την ερώτηση σου, αλλά δεν ξέρω πως αλλιώς να επικοινωνήσω. Μήπως ξέρεις αν υπάρχει κάπου έτοιμο μοντέλο για μεταφράσεις κειμένων από Αγγλικά και Γαλλικά σε ελληνικά;
Ευχαριστώ.
The final value is not a number of sentences but a probability to pick a sentence from a dataset.
Best is to look at the code and tests:
In this test there are 2 datasets with 4 lines and 2 lines.
In the first case the datasets have the same weight and the probability is the same as sampling uniformly from all data.
In the second case, the second dataset with 2 lines has double the weight of the first dataset with 4 lines: the examples of the second dataset will be seen 2 times more often than the first dataset.
There is no direct equivalent to the syntax you are referring to. But as mentioned above, setting a weight that is the double of another weight is the same as making the examples appear 2 times more often during the training.
The default value of sample_buffer_size is -1 which means “the dataset size”. But in the case of weighted dataset, there is no total dataset size. You should set a constant value but large enough to correctly shuffle the examples. For example set sample_buffer_size to 5000000, but not 0 which disables shuffling. We could indeed do that automatically.