Need help preproccesing my data corpus

I have 8 way parallel data. I want to train my seq2seq model on it. However, when I was trying to preprocess my data it basically said I needed to manually do it and cannot use ’ -train_dir ’ until its preprocessed. Can I get some help?

It specifically gives me the error:
“For directory mode, vocabs should be predefined.”


Did you try to follow the quickstart guide first?

Ya, I read through it before I posted here. I’m a nub sorry, so do I just need to preprocess each of my data files individually and then I can setup my -train_dir? I’m just hoping there was an easy already made method to preprocess a whole folder of data instead of one src and one tgt.

-train_dir is an advanced option I would not recommend to get started.

To train your model, you just need to prepare the 4 files listed in the quickstart:

  • src-train.txt
  • tgt-train.txt
  • src-val.txt
  • tgt-val.txt

If you have multiple training files, simply concatenate them.

I’d rather be able to experiment around with the advanced settings than never use them at all.

Could you share a correct way to use train_dir, please?

See this documentation if you are interested:

Yeah, the dynamic dataset block is what I was looking at that showed me the -train_dir option. But like I said, after trying that option it gave me "For directory mode, vocabs should be predefined.”

To clarify, if you use -train_dir and set it to the correct folder, and best case scenario everything is working, that does mean that it (by default) will use all the data files inside that folder for training (so long as they match line count) no matter if it is 1 file or a bunch of them?

Yes you should first build a vocabulary as the training data will be streamed. See the script tools/build_vocab.lua.

Yes, but if you have lots of data (> 10M), you should also consider using the sampling option as described in the documentation I linked above.

Thanks for the time you took to reply, you da man.