Alternate method of requesting vocab_size during preprocessing

dbl · February 28, 2017, 9:54pm

I usually pre-calculate my vocab sizes by counting frequencies and selecting a cutoff, then I use that value for vocab_size (instead of arbitrary sizes like 50000). E.g., count how many tokens appear in the corpus at least 5 times. I’ve written a python script to do this (specify min_count on command line) that makes a config file with the appropriate parameters.

I suspect others may do something similar, so I thought it might be cool to have something like {src,tgt}_vocab_token_min_count as an alternative to {src,tgt}_vocab_size.

guillaumekln · March 2, 2017, 2:02pm

This one was quick to add:

dbl · March 2, 2017, 2:27pm

That was quick! Thanks!