It would be nice to have the option of building vocabs on the fly for directory mode of both preprocess.lua and train.lua.
The current workaround is to create new source & target files (extra disk usage) by concatenating all of the files in the train_dir, then running build_vocab on the big files, then removing the big files once they’re no longer needed…