I was looking at the vocabulary documentation page and was wondering how multi-source vocabularies work. I tried googling some papers about it but couldn’t find any. Does the total vocabulary size end up being the sum of the vocabulary size of the different sources?
I’m planning on training a model from 2 different corpora (from the same domain), but one corpus is much larger than the other. Would it be suitable to build a vocabulary on each corpus, so words from the smaller corpus are covered? Otherwise, I may just oversample the small corpus, merge the two corpora together and build one vocabulary on the merged corpus.