Need help preproccesing my data corpus

Jason · January 8, 2018, 11:01am

I have 8 way parallel data. I want to train my seq2seq model on it. However, when I was trying to preprocess my data it basically said I needed to manually do it and cannot use ’ -train_dir ’ until its preprocessed. Can I get some help?

It specifically gives me the error:
“For directory mode, vocabs should be predefined.”

guillaumekln · January 8, 2018, 11:23am

Hello,

Did you try to follow the quickstart guide first?

Jason · January 8, 2018, 3:45pm

Ya, I read through it before I posted here. I’m a nub sorry, so do I just need to preprocess each of my data files individually and then I can setup my -train_dir? I’m just hoping there was an easy already made method to preprocess a whole folder of data instead of one src and one tgt.

guillaumekln · January 8, 2018, 3:58pm

-train_dir is an advanced option I would not recommend to get started.

To train your model, you just need to prepare the 4 files listed in the quickstart:

src-train.txt
tgt-train.txt
src-val.txt
tgt-val.txt

If you have multiple training files, simply concatenate them.

Jason · January 8, 2018, 5:11pm

I’d rather be able to experiment around with the advanced settings than never use them at all.

Could you share a correct way to use train_dir, please?

guillaumekln · January 8, 2018, 5:12pm

See this documentation if you are interested:

http://opennmt.net/OpenNMT/training/sampling/#file-sampling

Jason · January 8, 2018, 5:58pm

Yeah, the dynamic dataset block is what I was looking at that showed me the -train_dir option. But like I said, after trying that option it gave me "For directory mode, vocabs should be predefined.”

To clarify, if you use -train_dir and set it to the correct folder, and best case scenario everything is working, that does mean that it (by default) will use all the data files inside that folder for training (so long as they match line count) no matter if it is 1 file or a bunch of them?

guillaumekln · January 9, 2018, 8:57am

Yes you should first build a vocabulary as the training data will be streamed. See the script tools/build_vocab.lua.

Yes, but if you have lots of data (> 10M), you should also consider using the sampling option as described in the documentation I linked above.

Jason · January 9, 2018, 10:19am

Thanks for the time you took to reply, you da man.