OpenNMT Forum

OpenNMT preprocess question

Hi, I am using the onmt_preprocess command.
In my case, I putting encoding data via bpe method of sentencepiece to --train_src or --valid_src
I used bpe.vocab from the sentencepiece as --src_vocab options
How do you use the src_vocab or tgt_vocab option in the onmt_preprocess command?
If you use pretrain like bpe method and put encoding data in as --train_sc or --valid_src then, what’s the proper input in --src_vocab option?

You don’t necessarily need to put a vocab. It will be computed while preprocessing.

Thank your reply.
Yeah, I know that I don’t have to put a vocab option in
So is it no differences between not using vocab option and using vocab option?
and If not use vocab options, can’t i get a vocabulary?

The reason I ask you this question is because i saw someone who use the onmt-build-vocab command In OpenNMT-tf

OpenNMT-tf requires a vocab to be specified as it will read the data on the fly, without preprocessing.
OpenNMT-py requires preprocessing (for now), during which the vocab is built, if not provided.

If you want for instance to restrict to a specific vocab, you can specify it at preprocessing.

1 Like