Getting the vocabulary after preprocessing

Rajashan · June 28, 2019, 1:55pm

How does one get the vocabulary of a tensor after preprocessing with preprocess.py? I get files for train and validation datasets and a vocab file. This vocab file is a dict of fields and TextMultiField. I cannot seem to find any vocabulary mapping in these files.

tlysenko · June 28, 2019, 4:25pm

Hi Rajashan,

It is covered here https://github.com/OpenNMT/OpenNMT-py/issues/332

Rajashan · June 28, 2019, 7:08pm

But I don’t get torchtext.vocab.Vocab objects, I get onmt.inputters.text_dataset.TextMultiField objects, which don’t seem to be vocabs?

tlysenko · June 29, 2019, 9:38am

May be there are more experienced forum members who can help you with it. This approach always worked fine for me when I was opening ‘.vocab.pt’ file this way.

francoishernandez · September 24, 2019, 4:52pm

Hi there,
Yes the structure was changed a while ago. Now you have to go a bit deeper to get the vocab.
After loading your vocab with torch.load, you need to get vocab['src'].fields[0][1].vocab to get the torchtext Vocab object. (Same for ‘tgt’.)