Multi-source vocabularies

idfc_maldini · July 9, 2021, 1:29pm

Hi,

I was looking at the vocabulary documentation page and was wondering how multi-source vocabularies work. I tried googling some papers about it but couldn’t find any. Does the total vocabulary size end up being the sum of the vocabulary size of the different sources?

I’m planning on training a model from 2 different corpora (from the same domain), but one corpus is much larger than the other. Would it be suitable to build a vocabulary on each corpus, so words from the smaller corpus are covered? Otherwise, I may just oversample the small corpus, merge the two corpora together and build one vocabulary on the merged corpus.

Thanks,

Albert

guillaumekln · July 9, 2021, 1:51pm

Hi,

There are many papers about multi source neural machine translation. You can read this one for example:

In your case, it seems you actually have a single source (the source language) but your training data consists of 2 corpora. You can just merge the 2 files and build a single vocabulary as you mentioned.

idfc_maldini · July 9, 2021, 3:51pm

cool, thanks for your help!