I usually pre-calculate my vocab sizes by counting frequencies and selecting a cutoff, then I use that value for vocab_size
(instead of arbitrary sizes like 50000). E.g., count how many tokens appear in the corpus at least 5 times. I’ve written a python script to do this (specify min_count
on command line) that makes a config file with the appropriate parameters.
I suspect others may do something similar, so I thought it might be cool to have something like {src,tgt}_vocab_token_min_count
as an alternative to {src,tgt}_vocab_size
.