Dynamic Dataset

jean.senellart · August 31, 2017, 6:02pm

Hi all,

I just added a new feature to OpenNMT:

The idea is to remove completely the preprocessing stage to enable working on very large dataset.

The new features are:

in preprocess.lua - you can now use -train_dir PATH option to point to a directory with multiple training files (suffixes are by default .src/.tgt for bitext but can be defined with -src_suffix SRCSUF/-tgt_suffix TGTSUF): no need to aggregate files (they still need to be tokenized - further step might be also to integrate on-the-fly tokenization)
all the options of preprocess.lua can be used in train.lua - this for a triple goal:
** you don’t need to preprocess your data anymore
** this combines with -sample X option - which allows you to work on very large dataset (I tried it on 60M sentences) and sample at each epoch a subset of the full dataset without any memory problem. X can either be a number of sentences, or if float in [0,1] is representing a ratio of the complete dataset to sample at each iteration
** you can dynamically influence the sampling by providing a weight for each corpus you have in your -train_dir directory - see post below. In particular, you can over-sample some in-domain corpus but still keep a good mix with out of domain data

I am still testing but any feedback/test result/additional ideas are welcome!

jean.senellart · August 31, 2017, 10:06pm

To introduce weights for the different classes of the sampled data: assume your files (in train_dir directory) are the following:

generic.src, generic.tgt
IT1.src, IT1.tgt
IT2.src, IT2.tgt
MSDN.src, MSDN.tgt
colloquial.src, colloquial.tgt
news.src, news.tgt

if you want to have say 20% of IT (IT*+MSDN), 10% of colloquial, and 65% of generic - use the following rule file that you will specify with the option -sample_dist FILE:

IT,MSDN 20
colloquial 10
generic 65
* 5

each rule, is a collection of lua regexp separated by ,. * matches all files.
the numbers are normalized so that sum is 1
here, 20% will come from IT1, IT2 and MSDN together. In combination with -sample N parameter, this even enable over-sampling of under-represented categories.

jean.senellart · September 1, 2017, 9:32pm

final code is there - I have introduced parallel processing of data files in a train_dir so the usage and benefits are the following:

normal process using preprocess.lua
– it will be far faster if you use train_dir than concatenating different files manually
– you can use sampling, and sample_dist to automatically build your training data with a specific distribution of data out of a collection
new process - no need to use preprocess.lua at all, which save a lot of time, and with sampling you can have a far better control on training data distribution

tel34 · September 2, 2017, 8:52am

This is a really great step forward, @jean.senellart . I can’t wait to get started!

vince62s · September 2, 2017, 9:28am

Interesting.
One other thing that could help is to artificially modify the distribution of sentences length.

It is obvious that very short and very long segments (<10) and (>40) are under represented.
Maybe modifying this gaussian could help to better translate these instances.

jean.senellart · September 4, 2017, 8:10pm

@vince62s - yes it is a good idea. I will add that too.

jean.senellart · September 5, 2017, 8:52pm

On-the-fly tokenization is now also introduced. the training data does not need any type of preparation: all tokenization options are available for source and target with tok_src_ and tok_tgt_ prefixes.
This is done using multi-threads so will be optimal for large collections of documents.

vince62s · September 6, 2017, 7:49am

@jean.senellart,
Is it dynamic in the way that each dataset will generate one .t7 file and then it will sample / balance from each .t7 file dynamically, OR, do you generate only one signle .t7 file based on the sampling / balance you decide upfront ?

IMO it is more convenient to have several .t7 files already generated and balance at training for each run.

netxiao · September 6, 2017, 12:36pm

many thanks! this is a great feature!

jean.senellart · September 6, 2017, 3:19pm

@vince62s - there is no .t7 file generated at all - all the computation is done on the non-tokenized, unprocessed corpus - which is actually generally faster. The weighted sampling is done at each epoch.

vince62s · September 6, 2017, 3:25pm

hmm okay, just to be clear:

before, we had a “huge” processed data file (tenths of GB when dealing with a reasonable corpus) and it took many minutes to load this .t7 file before the begining of the training, so how does it work now ?

also, when using vocab importance sampling, ie many small epochs, does this add much overhead ?

jean.senellart · September 6, 2017, 3:38pm

my benchmark (using 10 CPU for preprocessing and individual files of about 2-3 M segments) compared 20M .t7 preprocessed (~30min loading time) vs. dynamic sampling (~3min processing time at each epoch). For an epoch lasting ~ 4-6 hours, it makes overhead negligible and allow you scale to more than 20M.
I can optimize a bit more (allocating multiple threads for a single file) if needed

Etienne38 · September 11, 2017, 12:51pm

I’m trying to experiment with this.

I’m using:

 -src_suffix fr -tgt_suffix en

Here is my -sample_dist, tested with or without the ‘.’ at the end of the name (after “fr-en”):

LM_MANAGEMENT.fr-en. 20
* 80

I always get this in the LOG, while I was expecting 20 on the distribution weight:

[09/11/17 14:38:32 INFO]  * file 'LM_MANAGEMENT.fr-en.' uniform weight: 0.6, distribution weight: 0.6

Is there something I didn’t understand properly ?

jean.senellart · September 11, 2017, 1:30pm

it sounds good expectation - let me try to reproduce.

jean.senellart · October 6, 2017, 6:21am

10 posts were split to a new topic: Out Of Memory with Dynamic Dataset

jean.senellart · September 11, 2017, 4:34pm

the problem is that LM_MANAGEMENT.fr-en. has to be a lua regex, so - should %- => LM_MANAGEMENT.fr%-en.. but concretely you should simply use LM_MANAGEMENT.

Etienne38 · September 11, 2017, 5:03pm

It’s ok for this point:

[09/11/17 19:01:59 INFO]  * file 'LM_MANAGEMENT.fr-en.' uniform weight: 0.6, distribution weight: 20.0

Etienne38 · September 11, 2017, 7:00pm

I’m using LuaJIT.

jean.senellart · September 12, 2017, 8:11am

For reference - here is how to use Dynamic Dataset to build a lm using the billion word lm dataset:

download dataset from [here](http://www.statmt.org/lm-benchmark/
build vocabulary using tools/build_vocab.lua
and … train for 100 epochs with 3000000 sentence samples using first held out set as validation data:

th train.lua lm -train_dir ~/lm/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled -suffix ''\
                      -valid ~/lm/1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050\
                      -vocab ~/lm/vocab-00001-50k.dict\
                      -preprocess_pthreads 10 -sample 3000000\
                      -gpuid 1
                      -save_every_epochs 5 -end_epoch 100\
                      -optim adam  -learning_rate 0.0002\
                      -reset_when_decay -learning_rate_decay 1 -start_decay_at 1\
                      -save_model ~/lm/blm1

Etienne38 · September 12, 2017, 8:19am

(last post deleted : I found the answer by myself)

I’m using -tok_src_case_feature true -tok_tgt_case_feature true, with my own dicts, with all entries lowercased. In the saved validation translations, all words with one or more uppercased letters appear UNK. I was expecting that, with the case feature activated, all words would be searched lowercased in the dicts.

I saw this in the validation translation, but, what about the training sentences ?

Is there a place where I can see what sentences are used for an epoch ?

Is it possible to also see how ONMT is handling their words with dicts ?

Is there something I did wrong ?