Dynamic Dataset

@vince62s - there is no .t7 file generated at all - all the computation is done on the non-tokenized, unprocessed corpus - which is actually generally faster. The weighted sampling is done at each epoch.

hmm okay, just to be clear:

before, we had a “huge” processed data file (tenths of GB when dealing with a reasonable corpus) and it took many minutes to load this .t7 file before the begining of the training, so how does it work now ?

also, when using vocab importance sampling, ie many small epochs, does this add much overhead ?

my benchmark (using 10 CPU for preprocessing and individual files of about 2-3 M segments) compared 20M .t7 preprocessed (~30min loading time) vs. dynamic sampling (~3min processing time at each epoch). For an epoch lasting ~ 4-6 hours, it makes overhead negligible and allow you scale to more than 20M.
I can optimize a bit more (allocating multiple threads for a single file) if needed

1 Like

I’m trying to experiment with this.

I’m using:

 -src_suffix fr -tgt_suffix en 

Here is my -sample_dist, tested with or without the ‘.’ at the end of the name (after “fr-en”):

LM_MANAGEMENT.fr-en. 20
* 80

I always get this in the LOG, while I was expecting 20 on the distribution weight:

[09/11/17 14:38:32 INFO]  * file 'LM_MANAGEMENT.fr-en.' uniform weight: 0.6, distribution weight: 0.6

Is there something I didn’t understand properly ?

:face_with_raised_eyebrow:

it sounds good expectation :slight_smile: - let me try to reproduce.

10 posts were split to a new topic: Out Of Memory with Dynamic Dataset

the problem is that LM_MANAGEMENT.fr-en. has to be a lua regex, so - should %- => LM_MANAGEMENT.fr%-en.. but concretely you should simply use LM_MANAGEMENT.

1 Like

It’s ok for this point:

[09/11/17 19:01:59 INFO]  * file 'LM_MANAGEMENT.fr-en.' uniform weight: 0.6, distribution weight: 20.0	

:slight_smile:

I’m using LuaJIT.

For reference - here is how to use Dynamic Dataset to build a lm using the billion word lm dataset:

  • download dataset from [here](http://www.statmt.org/lm-benchmark/
  • build vocabulary using tools/build_vocab.lua
  • and … train for 100 epochs with 3000000 sentence samples using first held out set as validation data:
th train.lua lm -train_dir ~/lm/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled -suffix ''\
                      -valid ~/lm/1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050\
                      -vocab ~/lm/vocab-00001-50k.dict\
                      -preprocess_pthreads 10 -sample 3000000\
                      -gpuid 1
                      -save_every_epochs 5 -end_epoch 100\
                      -optim adam  -learning_rate 0.0002\
                      -reset_when_decay -learning_rate_decay 1 -start_decay_at 1\
                      -save_model ~/lm/blm1 

(last post deleted : I found the answer by myself)

I’m using -tok_src_case_feature true -tok_tgt_case_feature true, with my own dicts, with all entries lowercased. In the saved validation translations, all words with one or more uppercased letters appear UNK. I was expecting that, with the case feature activated, all words would be searched lowercased in the dicts.

I saw this in the validation translation, but, what about the training sentences ?

Is there a place where I can see what sentences are used for an epoch ?

Is it possible to also see how ONMT is handling their words with dicts ?

Is there something I did wrong ?

to enable the on-the-fly tokenization - you do need to set -tok_(src|tgt)_mode conservative|aggressive or it just don’t tokenize. This is probably what is missing here.

Not sure to understand.

My sentences are already tokenized. It’s not a problem of tokenisation, but only a case problem.

For example, this is a validation tgt sentence, as defined in the validation file:
Express your vision again , speaking from your core and the center of your Circle of Success .
Here is how the validation was saved by ONMT:
<unk> your vision again , speaking from your core and the center of your <unk> of <unk> .

But, “success” is in the dics. Here is an other validation sentence with ‘success’:
Secondly , the EFSF deal was met with considerable success , with demand for €45bn compared with an initial issue amount of €5bn .
And here is how ONMT saved it:
<unk> , the <unk> deal was met with considerable success , with demand for €45bn compared with an initial issue amount of €5bn .

it was not really meant for that - the tokenization options are the ones from the tokenizer and you cannot today use case_feature without tokenizing. I added a new tokenization mode space which is doing what you expect. So use -tok_(src|tgt)_mode space (default is conservative).

1 Like

OK… With this tokenisation option, the preparation time is far slower than previously… but, all seem to work properly. The first cycle (with the first epoch) was completed properly. The second one is starting. I didn’t notice any other strange thing.

Thanks !
:slight_smile:

there is some overhead with the setting of the case_feature but it should not be that more slower.

Note that, I cleaned a bit the code - the sampling options are now called -gsample and -gsample_dist to differentiate with existing in-memory -sample option. The importance sampling ( -sample_vocab) is also implemented with this mode.

I also added the applying rule id for each corpus (0 means no rule) - so you can debug that it is the correct rules applying.

Going to clean-up and test a bit more, and will push to master soon.

The only strange thing, perhaps, is that, when asking for -sample 2000000, the preparation step seems to explore 2M sentences, but 300K are rejected, thus only 1.7M are kept. Perhaps it would be better to really produce 2M sentences. Each time a sentence is rejected, it would be nice to explore for an other one in replacement.

this one is trickier unfortunately because, I need to draw the sample before checking for the actual constraints on the sentence length, otherwise it will be far less efficient (keeping good data distribution). I keep that in mind, but it will requires some other type of data structure.

I didn’t keep the old LOG, but I think the whole preparation time was, perhaps, of few minutes.

Now, with both the tokenisation option “space” and the case_feature, it takes 40 mn, only for the MultiUN file:

[09/12/17 16:23:53 INFO]  * [-] file 'DGT.fr-en.': 1987655 total, 191344 drawn, 163200 kept - unknown words: source = 10.1%, target = 6.1%
[09/12/17 16:23:56 INFO]  * [-] file 'OpenOffice.fr-en.': 31902 total, 3072 drawn, 3050 kept - unknown words: source = 15.9%, target = 5.7%
[09/12/17 17:03:01 INFO]  * [-] file 'MultiUN.fr-en.': 10480212 total, 1008886 drawn, 802531 kept - unknown words: source = 5.5%, target = 3.0%
[09/12/17 17:03:19 INFO]  * [-] file 'LM_TRANSPORT.fr-en.': 211199 total, 20332 drawn, 20249 kept - unknown words: source = 15.8%, target = 7.4%

time is not right - the full preprocessing of the billion word corpus (30M segments in which I am sampling 3M) takes less than 4 minutes per epoch. I will do some additional benchmarks.