Get word frequency from the vocabulary

yiulau · September 24, 2019, 4:39pm

I used preprocess.py to generate a *.vocab.pt file with the -shared_vocab flag on . The output has these attributes

{'src': <onmt.inputters.text_dataset.TextMultiField at 0x7fa28f4e6410>,
 'tgt': <onmt.inputters.text_dataset.TextMultiField at 0x7fa258920210>,
 'indices': <torchtext.data.field.Field at 0x7fa258920310>}

I am trying to extract the word frequency for each word in the vocabulary so that I can match them with the corresponding row in the embedding matrix. How should I do it?

francoishernandez · September 24, 2019, 4:54pm

See here for how to get the torchtext Vocab object.
Once you have it, you can get its freqs attribute which is what you’re looking for.