The idea is to remove completely the preprocessing stage to enable working on very large dataset.
The new features are:
in preprocess.lua - you can now use -train_dir PATH option to point to a directory with multiple training files (suffixes are by default .src/.tgt for bitext but can be defined with -src_suffix SRCSUF/-tgt_suffix TGTSUF): no need to aggregate files (they still need to be tokenized - further step might be also to integrate on-the-fly tokenization)
all the options of preprocess.lua can be used in train.lua - this for a triple goal:
** you don’t need to preprocess your data anymore
** this combines with -sample X option - which allows you to work on very large dataset (I tried it on 60M sentences) and sample at each epoch a subset of the full dataset without any memory problem. X can either be a number of sentences, or if float in [0,1] is representing a ratio of the complete dataset to sample at each iteration
** you can dynamically influence the sampling by providing a weight for each corpus you have in your -train_dir directory - see post below. In particular, you can over-sample some in-domain corpus but still keep a good mix with out of domain data
I am still testing but any feedback/test result/additional ideas are welcome!
if you want to have say 20% of IT (IT*+MSDN), 10% of colloquial, and 65% of generic - use the following rule file that you will specify with the option -sample_dist FILE:
IT,MSDN 20
colloquial 10
generic 65
* 5
each rule, is a collection of lua regexp separated by ,. * matches all files.
the numbers are normalized so that sum is 1
here, 20% will come from IT1, IT2 and MSDN together. In combination with -sample N parameter, this even enable over-sampling of under-represented categories.
final code is there - I have introduced parallel processing of data files in a train_dir so the usage and benefits are the following:
normal process using preprocess.lua
– it will be far faster if you use train_dir than concatenating different files manually
– you can use sampling, and sample_dist to automatically build your training data with a specific distribution of data out of a collection
new process - no need to use preprocess.lua at all, which save a lot of time, and with sampling you can have a far better control on training data distribution
Interesting.
One other thing that could help is to artificially modify the distribution of sentences length.
It is obvious that very short and very long segments (<10) and (>40) are under represented.
Maybe modifying this gaussian could help to better translate these instances.
On-the-fly tokenization is now also introduced. the training data does not need any type of preparation: all tokenization options are available for source and target with tok_src_ and tok_tgt_ prefixes.
This is done using multi-threads so will be optimal for large collections of documents.
@jean.senellart,
Is it dynamic in the way that each dataset will generate one .t7 file and then it will sample / balance from each .t7 file dynamically, OR, do you generate only one signle .t7 file based on the sampling / balance you decide upfront ?
IMO it is more convenient to have several .t7 files already generated and balance at training for each run.
@vince62s - there is no .t7 file generated at all - all the computation is done on the non-tokenized, unprocessed corpus - which is actually generally faster. The weighted sampling is done at each epoch.
before, we had a “huge” processed data file (tenths of GB when dealing with a reasonable corpus) and it took many minutes to load this .t7 file before the begining of the training, so how does it work now ?
also, when using vocab importance sampling, ie many small epochs, does this add much overhead ?
my benchmark (using 10 CPU for preprocessing and individual files of about 2-3 M segments) compared 20M .t7 preprocessed (~30min loading time) vs. dynamic sampling (~3min processing time at each epoch). For an epoch lasting ~ 4-6 hours, it makes overhead negligible and allow you scale to more than 20M.
I can optimize a bit more (allocating multiple threads for a single file) if needed
the problem is that LM_MANAGEMENT.fr-en. has to be a lua regex, so - should %- => LM_MANAGEMENT.fr%-en.. but concretely you should simply use LM_MANAGEMENT.
(last post deleted : I found the answer by myself)
I’m using -tok_src_case_feature true -tok_tgt_case_feature true, with my own dicts, with all entries lowercased. In the saved validation translations, all words with one or more uppercased letters appear UNK. I was expecting that, with the case feature activated, all words would be searched lowercased in the dicts.
I saw this in the validation translation, but, what about the training sentences ?
Is there a place where I can see what sentences are used for an epoch ?
Is it possible to also see how ONMT is handling their words with dicts ?