Dealing with HUGE vocabularies

I am currently working with a set of morphologically rich languages each having a vocabulary size of about 2,50,000 to 3,00,00. I have been constantly getting the warning similar to

“UserWarning: Converting sparse IndexedSlices to a dense Tensor with 146618880 elements. This may consume a large amount of memory.”

When I try to truncate my vocabulary size to 50-60% of the size including only the most frequent words, I have a problem of too many unk tokens in my resultant output.

How do I deal with this?

Are there any options to make the onmt training more memory efficient?

Yes, it is very common to use byte pair encoding which splits your words into the most frequent subparts: https://github.com/OpenNMT/Tokenizer
Enjoy your huge vocabulary :wink:

1 Like

Thanks! Will check out.

1 Like