before, we had a “huge” processed data file (tenths of GB when dealing with a reasonable corpus) and it took many minutes to load this .t7 file before the begining of the training, so how does it work now ?
also, when using vocab importance sampling, ie many small epochs, does this add much overhead ?
my benchmark (using 10 CPU for preprocessing and individual files of about 2-3 M segments) compared 20M .t7 preprocessed (~30min loading time) vs. dynamic sampling (~3min processing time at each epoch). For an epoch lasting ~ 4-6 hours, it makes overhead negligible and allow you scale to more than 20M.
I can optimize a bit more (allocating multiple threads for a single file) if needed
the problem is that LM_MANAGEMENT.fr-en. has to be a lua regex, so - should %- => LM_MANAGEMENT.fr%-en.. but concretely you should simply use LM_MANAGEMENT.
(last post deleted : I found the answer by myself)
I’m using -tok_src_case_feature true -tok_tgt_case_feature true, with my own dicts, with all entries lowercased. In the saved validation translations, all words with one or more uppercased letters appear UNK. I was expecting that, with the case feature activated, all words would be searched lowercased in the dicts.
I saw this in the validation translation, but, what about the training sentences ?
Is there a place where I can see what sentences are used for an epoch ?
Is it possible to also see how ONMT is handling their words with dicts ?
to enable the on-the-fly tokenization - you do need to set -tok_(src|tgt)_mode conservative|aggressive or it just don’t tokenize. This is probably what is missing here.
My sentences are already tokenized. It’s not a problem of tokenisation, but only a case problem.
For example, this is a validation tgt sentence, as defined in the validation file: Express your vision again , speaking from your core and the center of your Circle of Success .
Here is how the validation was saved by ONMT: <unk> your vision again , speaking from your core and the center of your <unk> of <unk> .
But, “success” is in the dics. Here is an other validation sentence with ‘success’: Secondly , the EFSF deal was met with considerable success , with demand for €45bn compared with an initial issue amount of €5bn .
And here is how ONMT saved it: <unk> , the <unk> deal was met with considerable success , with demand for €45bn compared with an initial issue amount of €5bn .
it was not really meant for that - the tokenization options are the ones from the tokenizer and you cannot today use case_feature without tokenizing. I added a new tokenization mode space which is doing what you expect. So use -tok_(src|tgt)_mode space (default is conservative).
OK… With this tokenisation option, the preparation time is far slower than previously… but, all seem to work properly. The first cycle (with the first epoch) was completed properly. The second one is starting. I didn’t notice any other strange thing.
there is some overhead with the setting of the case_feature but it should not be that more slower.
Note that, I cleaned a bit the code - the sampling options are now called -gsample and -gsample_dist to differentiate with existing in-memory -sample option. The importance sampling ( -sample_vocab) is also implemented with this mode.
I also added the applying rule id for each corpus (0 means no rule) - so you can debug that it is the correct rules applying.
Going to clean-up and test a bit more, and will push to master soon.
The only strange thing, perhaps, is that, when asking for -sample 2000000, the preparation step seems to explore 2M sentences, but 300K are rejected, thus only 1.7M are kept. Perhaps it would be better to really produce 2M sentences. Each time a sentence is rejected, it would be nice to explore for an other one in replacement.
this one is trickier unfortunately because, I need to draw the sample before checking for the actual constraints on the sentence length, otherwise it will be far less efficient (keeping good data distribution). I keep that in mind, but it will requires some other type of data structure.
time is not right - the full preprocessing of the billion word corpus (30M segments in which I am sampling 3M) takes less than 4 minutes per epoch. I will do some additional benchmarks.