Getting the vocabulary after preprocessing

How does one get the vocabulary of a tensor after preprocessing with preprocess.py? I get files for train and validation datasets and a vocab file. This vocab file is a dict of fields and TextMultiField. I cannot seem to find any vocabulary mapping in these files.

Hi Rajashan,

It is covered here https://github.com/OpenNMT/OpenNMT-py/issues/332

But I don’t get torchtext.vocab.Vocab objects, I get onmt.inputters.text_dataset.TextMultiField objects, which don’t seem to be vocabs?

May be there are more experienced forum members who can help you with it. This approach always worked fine for me when I was opening ‘.vocab.pt’ file this way.

Hi there,
Yes the structure was changed a while ago. Now you have to go a bit deeper to get the vocab.
After loading your vocab with torch.load, you need to get vocab['src'].fields[0][1].vocab to get the torchtext Vocab object. (Same for ‘tgt’.)