Dealing with HUGE vocabularies

sanadi1209 · June 10, 2020, 6:06pm

I am currently working with a set of morphologically rich languages each having a vocabulary size of about 2,50,000 to 3,00,00. I have been constantly getting the warning similar to

“UserWarning: Converting sparse IndexedSlices to a dense Tensor with 146618880 elements. This may consume a large amount of memory.”

When I try to truncate my vocabulary size to 50-60% of the size including only the most frequent words, I have a problem of too many unk tokens in my resultant output.

How do I deal with this?

Are there any options to make the onmt training more memory efficient?

Bachstelze · June 10, 2020, 6:13pm

Yes, it is very common to use byte pair encoding which splits your words into the most frequent subparts: https://github.com/OpenNMT/Tokenizer
Enjoy your huge vocabulary

sanadi1209 · June 16, 2020, 7:10am

Thanks! Will check out.