Can OpenNMT-py support multi datasets for multilingual translation?

SefaZeng · December 1, 2021, 7:15am

There are some scenarios like I want to train a multilingual translation model with several datasets. And I want to give a sample probability distribution to sample different language data samples.
Do sampling before and merge all the data into one is a way to handle this, but every time when I add new data to one of the languages I need to reprocess all the data once again for the data balance problem. Can opennmt do the sampling durning the training stage now?
Any response is appreciate! Thx.

ymoslem · December 1, 2021, 3:01pm

Hello!

The equivalent to over-sampling in some NMT toolkits like OpenNMT is “weights”. Here is an example:
https://opennmt.net/OpenNMT-py/FAQ.html#how-can-i-weight-different-corpora-at-training

For example, if your data ratio is 10:1, you use weights of 1:10.

Regarding multilingually, I see your point about balancing data. If I have 70m sentences for Spanish and 14m sentences for Portuguese, I might want to balance the data. Still, if the difference is not huge, maybe I would not bother.

In my Indic-to-English multilingual MT model, I simply added (transliterated) datasets for 10 Indic languages individually to the configuration, but without weighting. I see that for some languages the model is more accurate than for others.

I tried weights to apply over-sampling in mixed fine-tuning for domain adaptation, so I know it works well. I might try it next in my multilingual MT model and see how it affects the performance. Obviously, the next step would be data augmentation with Back Translation as well.

Kind regards,
Yasmin

SefaZeng · December 2, 2021, 2:49am

Thank you so much for you answer!
I just go through the examples about the “weights”. I am wondering that if my sampling ratio is 82:40:9 for 3 corpus, and my batch size is 4096 tokens. What it will do for the sampling batches? As the batch size is not set for sentence.

ymoslem · December 3, 2021, 9:02pm

Hello!

My understanding is that weights work on sentence-level anyhow; it is more like a ratio. Still, either @francoishernandez or @guillaumekln can confirm or correct this.

Kind regards,
Yasmin

francoishernandez · December 6, 2021, 1:56pm

Weights are used on a dataset level, cf. docs.

Each entry of the data configuration will have its own weight. When building batches, we’ll sequentially take weight example from each corpus.

As for the question of sampling batches, the idea is that we don’t build batches one at a time, but with a pooling mechanism:

Note: don’t worry about batch homogeneity/heterogeneity, the pooling mechanism is here for that reason. Instead of building batches one at a time, we will load pool_factor of batches worth of examples, sort them by length, build batches and then yield them in a random order.

We experimented with sentence level weighting at some point but it was never merged.
Could probably be adapted quite easily to work with 2.0 though.

github.com/OpenNMT/OpenNMT-py

[WIP] Introduce sentence weighting

OpenNMT:master ← francoishernandez:sentence_weighting

opened 07:33PM - 17 May 19 UTC

francoishernandez

+80 -19

Basic idea: give an additional file when preprocessing, containing sentence weig…hts. These weights will be stored in the torchtext `Example`s and will be used to weight the loss in `_compute_loss`. I introduce the `-sentence_weights` opt, to which we are supposed to pass some text file(s) containing the weights for each sentence / example. If several corpora are passed according to #1413 upgrades, such weight files should be passed as well. If we want/have weights for only some of the corpora in the list, we can pass `None`/`none` instead of the filename and it will be cast to python `None` by argparse, and weights of 1 will be assigned. I did some tests on basic translation / speech / image runs, and it seems to be running without any issue.

SefaZeng · December 27, 2021, 5:58am

Hi, @francoishernandez Hope you are still around. I try to use the “weight” option for multiple datasets. And before that I do the sampling manually for all the datasets, for example I sampled three datasets 8.4x, 2.1x and 1.0x respectively. Then I calculate the ratio of each dataset after sampling, maybe 0.21, 0.35 and 0.44.
So if I understand correctly, I should set the “weight” to 21, 35 and 44 for the three datasets without sampling. But it seems that the effect of this method is worse, like -3.0 BLEU compare to the manually sampling. Is there anything wrong with my setting about the weight?
Any response is appreciate. Thx!

ymoslem · December 27, 2021, 10:10am

@SefaZeng Until François be back, I hope I got your question right.

The purpose of dataset over-sampling is to end up with datasets that virtually have an equal number of sentences. In OpenNMT-py, weights depend on sentence numbers, not ratio, so the number used for the weight should be the opposite of your data size.

Example

Data Size:
Dataset #1: 100,000 sentences
Dataset #2: 10,000 sentences

Data Weights:
Dataset #1: 1
Dataset #2: 10

So your training will take one sentence from Dataset #1 and ten sentences from Dataset #2.

I hope this is clearer now.

Kind regards,
Yasmin

SefaZeng · December 31, 2021, 2:11am

Thank you @ymoslem . But I dont want to all the corpus have an equal number of sentences. I will upsample the low resouce corpus and downsample the high resource corpus but not make all of them equal to each other. So maybe set the weight to the opposite of data size is not what I want.

ymoslem · December 31, 2021, 2:19pm

This was an example to give you an idea how weights work in OpenNMT-py.

In your example, you have 3 datasets, and you applied manual sampling: 8.4x, 2.1x and 1.0x. This means the first dataset is the smallest and you upsampled it, right?

If so, then your initial weights in OpenNMT-py were not correct. They were 21, 35 and 44, which means it would take a less number of samples from the smallest dataset.

If I understand your example correctly, to replace the manual sampling with weights in OpenNMT-py, the weights here should be 8, 2 and 1.

Apologies if this does not solve your issue, but then you might want to give more details about the original numbers in the datasets, and what you want to achieve.

All the best,
Yasmin