How does one get the vocabulary of a tensor after preprocessing with preprocess.py
? I get files for train and validation datasets and a vocab file. This vocab file is a dict of fields and TextMultiField
. I cannot seem to find any vocabulary mapping in these files.
But I don’t get torchtext.vocab.Vocab
objects, I get onmt.inputters.text_dataset.TextMultiField
objects, which don’t seem to be vocabs?
May be there are more experienced forum members who can help you with it. This approach always worked fine for me when I was opening ‘.vocab.pt’ file this way.
Hi there,
Yes the structure was changed a while ago. Now you have to go a bit deeper to get the vocab.
After loading your vocab with torch.load
, you need to get vocab['src'].fields[0][1].vocab
to get the torchtext Vocab
object. (Same for ‘tgt’.)