I have 8 way parallel data. I want to train my seq2seq model on it. However, when I was trying to preprocess my data it basically said I needed to manually do it and cannot use ’ -train_dir ’ until its preprocessed. Can I get some help?
It specifically gives me the error:
“For directory mode, vocabs should be predefined.”
Ya, I read through it before I posted here. I’m a nub sorry, so do I just need to preprocess each of my data files individually and then I can setup my -train_dir? I’m just hoping there was an easy already made method to preprocess a whole folder of data instead of one src and one tgt.
Yeah, the dynamic dataset block is what I was looking at that showed me the -train_dir option. But like I said, after trying that option it gave me "For directory mode, vocabs should be predefined.”
To clarify, if you use -train_dir and set it to the correct folder, and best case scenario everything is working, that does mean that it (by default) will use all the data files inside that folder for training (so long as they match line count) no matter if it is 1 file or a bunch of them?