Dynamic Dataset

vince62s · September 6, 2017, 3:25pm

hmm okay, just to be clear:

before, we had a “huge” processed data file (tenths of GB when dealing with a reasonable corpus) and it took many minutes to load this .t7 file before the begining of the training, so how does it work now ?

also, when using vocab importance sampling, ie many small epochs, does this add much overhead ?

jean.senellart · September 6, 2017, 3:38pm

my benchmark (using 10 CPU for preprocessing and individual files of about 2-3 M segments) compared 20M .t7 preprocessed (~30min loading time) vs. dynamic sampling (~3min processing time at each epoch). For an epoch lasting ~ 4-6 hours, it makes overhead negligible and allow you scale to more than 20M.
I can optimize a bit more (allocating multiple threads for a single file) if needed

Etienne38 · September 11, 2017, 12:51pm

I’m trying to experiment with this.

I’m using:

 -src_suffix fr -tgt_suffix en

Here is my -sample_dist, tested with or without the ‘.’ at the end of the name (after “fr-en”):

LM_MANAGEMENT.fr-en. 20
* 80

I always get this in the LOG, while I was expecting 20 on the distribution weight:

[09/11/17 14:38:32 INFO]  * file 'LM_MANAGEMENT.fr-en.' uniform weight: 0.6, distribution weight: 0.6

Is there something I didn’t understand properly ?

jean.senellart · September 11, 2017, 1:30pm

it sounds good expectation - let me try to reproduce.

jean.senellart · October 6, 2017, 6:21am

10 posts were split to a new topic: Out Of Memory with Dynamic Dataset

jean.senellart · September 11, 2017, 4:34pm

the problem is that LM_MANAGEMENT.fr-en. has to be a lua regex, so - should %- => LM_MANAGEMENT.fr%-en.. but concretely you should simply use LM_MANAGEMENT.

Etienne38 · September 11, 2017, 5:03pm

It’s ok for this point:

[09/11/17 19:01:59 INFO]  * file 'LM_MANAGEMENT.fr-en.' uniform weight: 0.6, distribution weight: 20.0

Etienne38 · September 11, 2017, 7:00pm

I’m using LuaJIT.

jean.senellart · September 12, 2017, 8:11am

For reference - here is how to use Dynamic Dataset to build a lm using the billion word lm dataset:

download dataset from [here](http://www.statmt.org/lm-benchmark/
build vocabulary using tools/build_vocab.lua
and … train for 100 epochs with 3000000 sentence samples using first held out set as validation data:

th train.lua lm -train_dir ~/lm/1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled -suffix ''\
                      -valid ~/lm/1-billion-word-language-modeling-benchmark-r13output/heldout-monolingual.tokenized.shuffled/news.en.heldout-00000-of-00050\
                      -vocab ~/lm/vocab-00001-50k.dict\
                      -preprocess_pthreads 10 -sample 3000000\
                      -gpuid 1
                      -save_every_epochs 5 -end_epoch 100\
                      -optim adam  -learning_rate 0.0002\
                      -reset_when_decay -learning_rate_decay 1 -start_decay_at 1\
                      -save_model ~/lm/blm1

Etienne38 · September 12, 2017, 8:19am

(last post deleted : I found the answer by myself)

I’m using -tok_src_case_feature true -tok_tgt_case_feature true, with my own dicts, with all entries lowercased. In the saved validation translations, all words with one or more uppercased letters appear UNK. I was expecting that, with the case feature activated, all words would be searched lowercased in the dicts.

I saw this in the validation translation, but, what about the training sentences ?

Is there a place where I can see what sentences are used for an epoch ?

Is it possible to also see how ONMT is handling their words with dicts ?

Is there something I did wrong ?

jean.senellart · September 12, 2017, 8:56am

to enable the on-the-fly tokenization - you do need to set -tok_(src|tgt)_mode conservative|aggressive or it just don’t tokenize. This is probably what is missing here.

Etienne38 · September 12, 2017, 9:09am

Not sure to understand.

My sentences are already tokenized. It’s not a problem of tokenisation, but only a case problem.

For example, this is a validation tgt sentence, as defined in the validation file:
Express your vision again , speaking from your core and the center of your Circle of Success .
Here is how the validation was saved by ONMT:
<unk> your vision again , speaking from your core and the center of your <unk> of <unk> .

But, “success” is in the dics. Here is an other validation sentence with ‘success’:
Secondly , the EFSF deal was met with considerable success , with demand for €45bn compared with an initial issue amount of €5bn .
And here is how ONMT saved it:
<unk> , the <unk> deal was met with considerable success , with demand for €45bn compared with an initial issue amount of €5bn .

jean.senellart · September 12, 2017, 9:32am

it was not really meant for that - the tokenization options are the ones from the tokenizer and you cannot today use case_feature without tokenizing. I added a new tokenization mode space which is doing what you expect. So use -tok_(src|tgt)_mode space (default is conservative).

Etienne38 · September 12, 2017, 2:30pm

OK… With this tokenisation option, the preparation time is far slower than previously… but, all seem to work properly. The first cycle (with the first epoch) was completed properly. The second one is starting. I didn’t notice any other strange thing.

Thanks !

jean.senellart · September 12, 2017, 3:05pm

there is some overhead with the setting of the case_feature but it should not be that more slower.

Note that, I cleaned a bit the code - the sampling options are now called -gsample and -gsample_dist to differentiate with existing in-memory -sample option. The importance sampling ( -sample_vocab) is also implemented with this mode.

I also added the applying rule id for each corpus (0 means no rule) - so you can debug that it is the correct rules applying.

Going to clean-up and test a bit more, and will push to master soon.

Etienne38 · September 12, 2017, 3:08pm

The only strange thing, perhaps, is that, when asking for -sample 2000000, the preparation step seems to explore 2M sentences, but 300K are rejected, thus only 1.7M are kept. Perhaps it would be better to really produce 2M sentences. Each time a sentence is rejected, it would be nice to explore for an other one in replacement.

jean.senellart · September 12, 2017, 3:12pm

this one is trickier unfortunately because, I need to draw the sample before checking for the actual constraints on the sentence length, otherwise it will be far less efficient (keeping good data distribution). I keep that in mind, but it will requires some other type of data structure.

Etienne38 · September 12, 2017, 3:17pm

I didn’t keep the old LOG, but I think the whole preparation time was, perhaps, of few minutes.

Now, with both the tokenisation option “space” and the case_feature, it takes 40 mn, only for the MultiUN file:

[09/12/17 16:23:53 INFO]  * [-] file 'DGT.fr-en.': 1987655 total, 191344 drawn, 163200 kept - unknown words: source = 10.1%, target = 6.1%
[09/12/17 16:23:56 INFO]  * [-] file 'OpenOffice.fr-en.': 31902 total, 3072 drawn, 3050 kept - unknown words: source = 15.9%, target = 5.7%
[09/12/17 17:03:01 INFO]  * [-] file 'MultiUN.fr-en.': 10480212 total, 1008886 drawn, 802531 kept - unknown words: source = 5.5%, target = 3.0%
[09/12/17 17:03:19 INFO]  * [-] file 'LM_TRANSPORT.fr-en.': 211199 total, 20332 drawn, 20249 kept - unknown words: source = 15.8%, target = 7.4%

jean.senellart · September 12, 2017, 3:43pm

time is not right - the full preprocessing of the billion word corpus (30M segments in which I am sampling 3M) takes less than 4 minutes per epoch. I will do some additional benchmarks.

Etienne38 · September 12, 2017, 4:26pm

Could this be due to the -preprocess_pthreads 1 option discussed above, comparing it to your own config ?