Steps to convert SentencePiece vocab to OpenNMT-py vocab

If we are going to build vocabulary from scratch for OpenNMT-py, we should use something like this:

# -config: path to your config.yaml file
# -n_sample: use -1 to build vocabulary on all the segment in the training dataset
# -num_threads: change it to match the number of CPUs to run it faster

onmt_build_vocab -config config.yaml -n_sample -1 -num_threads 2

However, many of us use SentencePiece for sub-wording, which generates both a sub-wording model and a vocabulary list. We cannot use this vocab file generated by SentencePiece directly in OpenNMT-py. So, if we want to, we have to convert it to a version compatible with OpenNMT-py. Note that the new vocab file will be 3 lines less as the script removes the default tokens in OpenNMT-py, e.g. <unk>, <s>, and </s>.

pip3 install --upgrade OpenNMT-py
wget https://raw.githubusercontent.com/OpenNMT/OpenNMT-py/master/tools/spm_to_vocab.py
cat spm.vocab | python3 spm_to_vocab.py > spm.onmt_vocab

Kind regards,
Yasmin

2 Likes

@ymoslem Cloned and Installed OpenNMT-py on Colab:
git clone GitHub - OpenNMT/OpenNMT-py: Open Source Neural Machine Translation in PyTorch
python3 /content/OpenNMT-py/setup.py install

we excuted this command:
pip install onmt
cat /content/as_spm_10000.vocab | python3 /content/OpenNMT-py/tools/spm_to_vocab.py > spm.onmt_vocab

We encountered the following error:
Traceback (most recent call last):
File “/content/OpenNMT-py/tools/spm_to_vocab.py”, line 6, in
from onmt.constants import DefaultTokens
ModuleNotFoundError: No module named ‘onmt.constants’

This is not OpenNMT-py. Please try the following:

pip3 install --upgrade OpenNMT-py
wget https://raw.githubusercontent.com/OpenNMT/OpenNMT-py/master/tools/spm_to_vocab.py
cat spm.vocab | python3 spm_to_vocab.py > spm.onmt_vocab

For more about how to run OpenNMT-py, feel free to refer to this tutorial:

Kind regards,
Yasmin

Hey @ymoslem, hey Community,

is building a vocabulary with onmt_build_vocab essentially the same as building it with SentencePiece? Or in other words, can I achieve the same (or close to) Byte Pair Encoding with onmt_build_vocab that I’d get with the SentencePiece trainer?

And if I trained my BPE model+vocab with SP, do I still build some onmt vocabulary with the onmt_build_vocab or is this not necessary since spm_to_vocab.py already transforms it into the right format? I am wondering since there are so many flags for *_subword_model in the vocab --help.

I think I am wondering if I can just use the onmt pipeline all the way because it might be more simple as it guarantees compatibility.

For context, I am trying to build a translation model using some pivot strategy from scratch for my master’s thesis. For this I am trying to build a shared-vocabulary between source and pivot language.

Thank you for your time and help, kind regards,
Jonny

Hi Jonny,

If you use onmt_build_vocab without any sub-wording transforms, it does not subword. It just counts the vocabulary and calculates frequencies. You need to set up a Transform in your config to subword on the fly. You can find more information about the Tokenization transform here:
https://opennmt.net/OpenNMT-py/FAQ.html#tokenization

With this in mind, I personally find training a SentencePiece model easier. However, if you know how to use the Transform, this can save you an extra step.

All the best,
Yasmin

Hey @ymoslem,

thanks a lot for the direction :slight_smile:. I’ll give the SentencePiece approach a try then!

Thanks and to you too,
Jonny