Vocabulary size option in OpenNMT-tf and OpenNMT-py

onmt-build-vocab --size 50000 in OpenNMT-tf produces a vocabulary of size 50,000 including three additional tokens <blank>, <s>, </s> (50,000 lines in the output), while onmt_preprocess --src_vocab_size 50000 in OpenNMT-py seems to produce a vocabulary of size 50,000 excluding additional tokens (<unk> and <blank> are added if you look at .vocab.pt file).

If we want the same vocabulary in both versions, should we use onmt-build-vocab --size 50003, instead of onmt-build-vocab --size 50000 in OpenNMT-tf?

Yes, the final vocabulary size according to the model would be 50003 + 1 (+ 1 for the <unk> token which is not included in the vocabulary file unlike OpenNMT-py).

This matches the OpenNMT-py vocabulary size which is 50000 + 4.

1 Like