Dynamic Dataset

Hi all,

I just added a new feature to OpenNMT:

The idea is to remove completely the preprocessing stage to enable working on very large dataset.

The new features are:

  1. in preprocess.lua - you can now use -train_dir PATH option to point to a directory with multiple training files (suffixes are by default .src/.tgt for bitext but can be defined with -src_suffix SRCSUF/-tgt_suffix TGTSUF): no need to aggregate files (they still need to be tokenized - further step might be also to integrate on-the-fly tokenization)
  2. all the options of preprocess.lua can be used in train.lua - this for a triple goal:
    ** you don’t need to preprocess your data anymore
    ** this combines with -sample X option - which allows you to work on very large dataset (I tried it on 60M sentences) and sample at each epoch a subset of the full dataset without any memory problem. X can either be a number of sentences, or if float in [0,1] is representing a ratio of the complete dataset to sample at each iteration
    ** you can dynamically influence the sampling by providing a weight for each corpus you have in your -train_dir directory - see post below. In particular, you can over-sample some in-domain corpus but still keep a good mix with out of domain data

I am still testing but any feedback/test result/additional ideas are welcome!

3 Likes

To introduce weights for the different classes of the sampled data: assume your files (in train_dir directory) are the following:

generic.src, generic.tgt
IT1.src, IT1.tgt
IT2.src, IT2.tgt
MSDN.src, MSDN.tgt
colloquial.src, colloquial.tgt
news.src, news.tgt

if you want to have say 20% of IT (IT*+MSDN), 10% of colloquial, and 65% of generic - use the following rule file that you will specify with the option -sample_dist FILE:

IT,MSDN 20
colloquial 10
generic 65
* 5
  • each rule, is a collection of lua regexp separated by ,. * matches all files.
  • the numbers are normalized so that sum is 1
  • here, 20% will come from IT1, IT2 and MSDN together. In combination with -sample N parameter, this even enable over-sampling of under-represented categories.

final code is there - I have introduced parallel processing of data files in a train_dir so the usage and benefits are the following:

  • normal process using preprocess.lua
    – it will be far faster if you use train_dir than concatenating different files manually
    – you can use sampling, and sample_dist to automatically build your training data with a specific distribution of data out of a collection
  • new process - no need to use preprocess.lua at all, which save a lot of time, and with sampling you can have a far better control on training data distribution
1 Like

This is a really great step forward, @jean.senellart . I can’t wait to get started!

Interesting.
One other thing that could help is to artificially modify the distribution of sentences length.

It is obvious that very short and very long segments (<10) and (>40) are under represented.
Maybe modifying this gaussian could help to better translate these instances.

@vince62s - yes it is a good idea. I will add that too.

On-the-fly tokenization is now also introduced. the training data does not need any type of preparation: all tokenization options are available for source and target with tok_src_ and tok_tgt_ prefixes.
This is done using multi-threads so will be optimal for large collections of documents.

1 Like

@jean.senellart,
Is it dynamic in the way that each dataset will generate one .t7 file and then it will sample / balance from each .t7 file dynamically, OR, do you generate only one signle .t7 file based on the sampling / balance you decide upfront ?

IMO it is more convenient to have several .t7 files already generated and balance at training for each run.

many thanks! this is a great feature!

@vince62s - there is no .t7 file generated at all - all the computation is done on the non-tokenized, unprocessed corpus - which is actually generally faster. The weighted sampling is done at each epoch.

hmm okay, just to be clear:

before, we had a “huge” processed data file (tenths of GB when dealing with a reasonable corpus) and it took many minutes to load this .t7 file before the begining of the training, so how does it work now ?

also, when using vocab importance sampling, ie many small epochs, does this add much overhead ?

my benchmark (using 10 CPU for preprocessing and individual files of about 2-3 M segments) compared 20M .t7 preprocessed (~30min loading time) vs. dynamic sampling (~3min processing time at each epoch). For an epoch lasting ~ 4-6 hours, it makes overhead negligible and allow you scale to more than 20M.
I can optimize a bit more (allocating multiple threads for a single file) if needed

1 Like

I’m trying to experiment with this.

I’m using:

 -src_suffix fr -tgt_suffix en 

Here is my -sample_dist, tested with or without the ‘.’ at the end of the name (after “fr-en”):

LM_MANAGEMENT.fr-en. 20
* 80

I always get this in the LOG, while I was expecting 20 on the distribution weight:

[09/11/17 14:38:32 INFO]  * file 'LM_MANAGEMENT.fr-en.' uniform weight: 0.6, distribution weight: 0.6

Is there something I didn’t understand properly ?

:face_with_raised_eyebrow:

it sounds good expectation :slight_smile: - let me try to reproduce.

10 posts were split to a new topic: Out Of Memory with Dynamic Dataset

the problem is that LM_MANAGEMENT.fr-en. has to be a lua regex, so - should %- => LM_MANAGEMENT.fr%-en.. but concretely you should simply use LM_MANAGEMENT.

1 Like

It’s ok for this point:

[09/11/17 19:01:59 INFO]  * file 'LM_MANAGEMENT.fr-en.' uniform weight: 0.6, distribution weight: 20.0	

:slight_smile:

I’m using LuaJIT.

For reference - here is how to use Dynamic Dataset to build a lm using the billion word lm dataset:

  • download dataset from [here](http://www.statmt.org/lm-benchmark/
  • build vocabulary using tools/build_vocab.lua
  • and … train for 100 epochs with 3000000 sentence samples using first held out set as validation data:
th train.lua lm -train_dir ~/lm/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled -suffix ''\
                      -valid ~/lm/1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050\
                      -vocab ~/lm/vocab-00001-50k.dict\
                      -preprocess_pthreads 10 -sample 3000000\
                      -gpuid 1
                      -save_every_epochs 5 -end_epoch 100\
                      -optim adam  -learning_rate 0.0002\
                      -reset_when_decay -learning_rate_decay 1 -start_decay_at 1\
                      -save_model ~/lm/blm1 

(last post deleted : I found the answer by myself)

I’m using -tok_src_case_feature true -tok_tgt_case_feature true, with my own dicts, with all entries lowercased. In the saved validation translations, all words with one or more uppercased letters appear UNK. I was expecting that, with the case feature activated, all words would be searched lowercased in the dicts.

I saw this in the validation translation, but, what about the training sentences ?

Is there a place where I can see what sentences are used for an epoch ?

Is it possible to also see how ONMT is handling their words with dicts ?

Is there something I did wrong ?