Need help preproccesing my data corpus

(Jason DeLong) #1

I have 8 way parallel data. I want to train my seq2seq model on it. However, when I was trying to preprocess my data it basically said I needed to manually do it and cannot use ’ -train_dir ’ until its preprocessed. Can I get some help?

It specifically gives me the error:
“For directory mode, vocabs should be predefined.”

(Guillaume Klein) #2


Did you try to follow the quickstart guide first?

(Jason DeLong) #3

Ya, I read through it before I posted here. I’m a nub sorry, so do I just need to preprocess each of my data files individually and then I can setup my -train_dir? I’m just hoping there was an easy already made method to preprocess a whole folder of data instead of one src and one tgt.

(Guillaume Klein) #4

-train_dir is an advanced option I would not recommend to get started.

To train your model, you just need to prepare the 4 files listed in the quickstart:

  • src-train.txt
  • tgt-train.txt
  • src-val.txt
  • tgt-val.txt

If you have multiple training files, simply concatenate them.

(Jason DeLong) #5

I’d rather be able to experiment around with the advanced settings than never use them at all.

Could you share a correct way to use train_dir, please?

(Guillaume Klein) #6

See this documentation if you are interested:

(Jason DeLong) #7

Yeah, the dynamic dataset block is what I was looking at that showed me the -train_dir option. But like I said, after trying that option it gave me "For directory mode, vocabs should be predefined.”

To clarify, if you use -train_dir and set it to the correct folder, and best case scenario everything is working, that does mean that it (by default) will use all the data files inside that folder for training (so long as they match line count) no matter if it is 1 file or a bunch of them?

(Guillaume Klein) #8

Yes you should first build a vocabulary as the training data will be streamed. See the script tools/build_vocab.lua.

Yes, but if you have lots of data (> 10M), you should also consider using the sampling option as described in the documentation I linked above.

(Jason DeLong) #9

Thanks for the time you took to reply, you da man.