I just added a new feature to OpenNMT:
The idea is to remove completely the preprocessing stage to enable working on very large dataset.
The new features are:
preprocess.lua- you can now use
-train_dir PATHoption to point to a directory with multiple training files (suffixes are by default
.tgtfor bitext but can be defined with
-tgt_suffix TGTSUF): no need to aggregate files (they still need to be tokenized - further step might be also to integrate on-the-fly tokenization)
- all the options of
preprocess.luacan be used in
train.lua- this for a triple goal:
** you don’t need to preprocess your data anymore
** this combines with
-sample Xoption - which allows you to work on very large dataset (I tried it on 60M sentences) and sample at each epoch a subset of the full dataset without any memory problem.
Xcan either be a number of sentences, or if float in
[0,1]is representing a ratio of the complete dataset to sample at each iteration
** you can dynamically influence the sampling by providing a weight for each corpus you have in your
-train_dirdirectory - see post below. In particular, you can over-sample some in-domain corpus but still keep a good mix with out of domain data
I am still testing but any feedback/test result/additional ideas are welcome!