If we are going to build vocabulary from scratch for OpenNMT-py, we should use something like this:
# -config: path to your config.yaml file # -n_sample: use -1 to build vocabulary on all the segment in the training dataset # -num_threads: change it to match the number of CPUs to run it faster onmt_build_vocab -config config.yaml -n_sample -1 -num_threads 2
However, many of us use SentencePiece for sub-wording, which generates both a sub-wording model and a vocabulary list. We cannot use this vocab file generated by SentencePiece directly in OpenNMT-py. So, if we want to, we have to convert it to a version compatible with OpenNMT-py. Note that the new vocab file will be 3 lines less as the script removes the default tokens in OpenNMT-py, e.g.
git clone https://github.com/OpenNMT/OpenNMT-py.git cd OpenNMT-py python3 setup.py install cd .. cat spm.vocab | python3 OpenNMT-py/tools/spm_to_vocab.py > spm.onmt_vocab