to enable the on-the-fly tokenization - you do need to set -tok_(src|tgt)_mode conservative|aggressive
or it just don’t tokenize. This is probably what is missing here.
Not sure to understand.
My sentences are already tokenized. It’s not a problem of tokenisation, but only a case problem.
For example, this is a validation tgt sentence, as defined in the validation file:
Express your vision again , speaking from your core and the center of your Circle of Success .
Here is how the validation was saved by ONMT:
<unk> your vision again , speaking from your core and the center of your <unk> of <unk> .
But, “success” is in the dics. Here is an other validation sentence with ‘success’:
Secondly , the EFSF deal was met with considerable success , with demand for €45bn compared with an initial issue amount of €5bn .
And here is how ONMT saved it:
<unk> , the <unk> deal was met with considerable success , with demand for €45bn compared with an initial issue amount of €5bn .
it was not really meant for that - the tokenization options are the ones from the tokenizer and you cannot today use case_feature
without tokenizing. I added a new tokenization mode space
which is doing what you expect. So use -tok_(src|tgt)_mode space
(default is conservative).
OK… With this tokenisation option, the preparation time is far slower than previously… but, all seem to work properly. The first cycle (with the first epoch) was completed properly. The second one is starting. I didn’t notice any other strange thing.
Thanks !
there is some overhead with the setting of the case_feature
but it should not be that more slower.
Note that, I cleaned a bit the code - the sampling options are now called -gsample
and -gsample_dist
to differentiate with existing in-memory -sample
option. The importance sampling ( -sample_vocab
) is also implemented with this mode.
I also added the applying rule id for each corpus (0 means no rule) - so you can debug that it is the correct rules applying.
Going to clean-up and test a bit more, and will push to master soon.
The only strange thing, perhaps, is that, when asking for -sample 2000000, the preparation step seems to explore 2M sentences, but 300K are rejected, thus only 1.7M are kept. Perhaps it would be better to really produce 2M sentences. Each time a sentence is rejected, it would be nice to explore for an other one in replacement.
this one is trickier unfortunately because, I need to draw the sample before checking for the actual constraints on the sentence length, otherwise it will be far less efficient (keeping good data distribution). I keep that in mind, but it will requires some other type of data structure.
I didn’t keep the old LOG, but I think the whole preparation time was, perhaps, of few minutes.
Now, with both the tokenisation option “space” and the case_feature, it takes 40 mn, only for the MultiUN file:
[09/12/17 16:23:53 INFO] * [-] file 'DGT.fr-en.': 1987655 total, 191344 drawn, 163200 kept - unknown words: source = 10.1%, target = 6.1%
[09/12/17 16:23:56 INFO] * [-] file 'OpenOffice.fr-en.': 31902 total, 3072 drawn, 3050 kept - unknown words: source = 15.9%, target = 5.7%
[09/12/17 17:03:01 INFO] * [-] file 'MultiUN.fr-en.': 10480212 total, 1008886 drawn, 802531 kept - unknown words: source = 5.5%, target = 3.0%
[09/12/17 17:03:19 INFO] * [-] file 'LM_TRANSPORT.fr-en.': 211199 total, 20332 drawn, 20249 kept - unknown words: source = 15.8%, target = 7.4%
time is not right - the full preprocessing of the billion word corpus (30M segments in which I am sampling 3M) takes less than 4 minutes per epoch. I will do some additional benchmarks.
Could this be due to the -preprocess_pthreads 1
option discussed above, comparing it to your own config ?
no - because on one single document, today it is mono-threaded.
@Etienne38, I just commit some optimization on the tokenizer - which gives a 50% time reduction and also fixed the preprocessing - it was tokenizing all sentences, not only the ones in the sample.
Can you measure the following using this corpus [https://s3.amazonaws.com/opennmt-trainingdata/baseline-1M-enfr.tgz]:
$ th tools/tokenize.lua -mode space -case_feature < baseline-1M-enfr/baseline-1M_train.fr > /dev/null
Tokenization completed in 50.459 seconds - 1009163 sentences
$ th tools/tokenize.lua -mode aggressive -case_feature < baseline-1M-enfr/baseline-1M_train.fr > /dev/null
Tokenization completed in 128.618 seconds - 1009163 sentences
$ th train.lua -train_dir baseline-1M-enfr -save_model t -valid_src baseline-1M-enfr/baseline-1M_valid.en -valid_tgt baseline-1M-enfr/baseline-1M_valid.fr -src_vocab data/demo.src.dict -tgt_vocab data/demo.tgt.dict -tok_src_mode aggressive -tok_src_case_feature -tok_tgt_mode aggressive -gsample 100000 -src_suffix .en -tgt_suffix .fr
[...]
[09/12/17 22:52:12 INFO] * [1] file 'baseline-1M_test': 1000 total, 99 drawn, 97 kept - unknown words: source = 5.7%, target = 38.5%
[09/12/17 22:52:13 INFO] * [1] file 'baseline-1M_valid': 1000 total, 99 drawn, 97 kept - unknown words: source = 4.2%, target = 39.3%
[09/12/17 22:52:46 INFO] * [2] file 'baseline-1M_train': 1009163 total, 99803 drawn, 98062 kept - unknown words: source = 5.1%, target
[CTRL-C]
You should have about the same - and time on 1 thread for Multi-UN should be < 4min.
Finally, even with LuaJIT - you can try at least 2 threads.
I just tried to restart my current training from the last checkpoint (epoch 5), with the new code.
With 2 threads, I still got the OutOfMemory error.
The preparation time, with only one thread, took a bit less than 5 mn for all files (16.7M).
I don’t see something else noticeable. All seems ok.
Thanks.
Before stopping the training, my carte was loaded 10G/11G. After restarting from the checkpoint, it is only loaded 7G/11G.
I suppose the difference is coming from the validation translation step, right ?
Is it possible there is a memory leak ?
Is it possible there is a memory leak ?
Unlikely - it is probably connected to lua&torch memory management that we don’t control completely. But it should not have any functional impact.
I didn’t have a look at each moment to be sure, but I think the memory was going up at the validation translation time, before the end of the first epoch. The be confirmed.
Hi all, feature is now live on the master. Thanks a lot to @Etienne38 for all the testing. Documentation is here: http://opennmt.net/OpenNMT/training/sampling/#file-sampling
Is it possible to use the feature without sampling, or really with over-sampling?..
I had the idea of a 70/20/10 weighting scheme on something where the 70% bin had a lot less data than the 20, and a little more than the 10. I’d like to use all of the data, since all together it’s only ~1.6M segments, but I really like the weighting idea. (I could manually duplicate the 70% bin’s data to achieve this, but it would be cool to do it with the new feature if possible.)
If your 70/20/10 don’t exactly match an exact multiple of your data quantities, I think you are obliged to use a kind of random way to build the final set. Thus, this new DynData should meet your need with, for example, building a 2M or 3M set, over your original 1.6M set, using 70/20/10 in the gsample_dist parameters.
Yes exactly, it should be doing what you expect. If you ask for 2M and 70% out of a sub-corpus that has less than uniform 20% distribution, then you will have an average (and random) 3.5 oversampling on this part.