onmt-build-vocab --size 50000
in OpenNMT-tf produces a vocabulary of size 50,000 including three additional tokens <blank>
, <s>
, </s>
(50,000 lines in the output), while onmt_preprocess --src_vocab_size 50000
in OpenNMT-py seems to produce a vocabulary of size 50,000 excluding additional tokens (<unk>
and <blank>
are added if you look at .vocab.pt
file).
If we want the same vocabulary in both versions, should we use onmt-build-vocab --size 50003
, instead of onmt-build-vocab --size 50000
in OpenNMT-tf?