Vocabs in preprocess.py

pdakwale · October 16, 2017, 11:23am

Hi,
I am using the pytorch version and have some issue with pre-processing.
My requirement is to train a vocabulary for a large bitext and use it to process and train over subsets of the large set separately. So I am trying to first use preprocess.py to train a vocabulary and then use preprocess.py again for subset bitexts by providing pre-trained vocab through -src_vocab and -tgt_vocab. Current version for preprocess.py in pytorch version builds a combined vocabulary file (combined.data.vocab.pt) for source and target side. However, the input to preprocess.py requires two separate vocab files for source and target. If the output of a previous vocab training was a single file, i can not provide two separate files.
Please clarify if there is a flag to create separate source and target vocab files that I am missing in pytorch version ? The Lua version creates two different files and doesn’t have this problem.

Thanks
Praveen

guillaumekln · October 16, 2017, 12:45pm

Hello,

I recommend you to open an issue on the OpenNMT-py GitHub. More people will be there to assist you.