Pre-processing steps to split?

(Vincent Nguyen) #1

Today, vocab building and data preparation are embedded in the same step.

In some scenario (where we want a out-of-domain and an in-domain data set) it woudl be better to build the dictionnary based on both data sets, and then preprocess the data preparation individually.

This can be done of course but with extra manipulation with current scripts, or extra coding at a lower level.

Maybe we just need to split dictionary build and actual data preparation.


(Guillaume Klein) #2

Yes, we could easily provide a new script that just builds the dictionary. Adding it to my list.