Alternate method of requesting vocab_size during preprocessing

I usually pre-calculate my vocab sizes by counting frequencies and selecting a cutoff, then I use that value for vocab_size (instead of arbitrary sizes like 50000). E.g., count how many tokens appear in the corpus at least 5 times. I’ve written a python script to do this (specify min_count on command line) that makes a config file with the appropriate parameters.

I suspect others may do something similar, so I thought it might be cool to have something like {src,tgt}_vocab_token_min_count as an alternative to {src,tgt}_vocab_size.

This one was quick to add:

2 Likes

That was quick! Thanks! :+1: