Word embeddings sizes

joni115 · February 27, 2019, 1:45pm

Hello, I have some doubts about the generated word embedding for NMTSmall architecture. As far as I know about word embeddings and neural architectures, word embedding matrix is supposed to be in vocab_source_sizeXembedding_size space. It happens quite the same with target words embeddings. However, when I was playing around with the generated embeddings, I found out that the dimensions does not corresponds to the source/target vocabulary. You can see the picture below…

I thought about the special tokens (, and so on), but vocabulary files contains special tokens too, so I can’t find out why the embedding matrix is in (vocab_size+2)Xsize_embedding space. If you can explain to me the reason, I will appreciate it.

Thank you in advance.

guillaumekln · February 27, 2019, 1:49pm

Hi,

There is an additional index for tokens that are not in your vocabulary file. So the final embedding shape is [vocab_size + 1, embedding_size].

joni115 · February 27, 2019, 1:55pm

Thank you a lot for the quickly response. However, since the index begin in 0, it supposed to be [(vocab_size-1) + 1, embedding_size]. Vocab_size-1 is because of indexation and the +1 because of the OOV token. In my case it is still one more dimension.

guillaumekln · February 27, 2019, 2:02pm

No, it’s not because of the indexation.

[vocab_size + 1, embedding_size]

here vocab_size simply refers to the number of lines in the vocabulary file, e.g. for you source file: [2151 + 1, 512]. No need to look further.

joni115 · February 27, 2019, 2:10pm

It’s true. Thank you a lot!