These days I am part-time doing work on improving translation models. We are working with regular transformer seq2seq networks using OpenNMT. This question is not about Opennmt specifically but it was triggered by going through the documentation. In onmt one can add features to each word. These features are then used to train their own embedding. For example, if you want to train a lower case model but still want to give importance to casing, you can add a casing feature that indicates whether the word was lower case or not.
i￨C like￨l cookies￨l from￨l new￨C york￨C
This will create two embedding layers under the hood. One for the tokens, and one for the case features.
In the documentation, it states that the default size for features is
… set to N^feat_vec_exponent where N is the number of values the feature takes.
where the default feat_vec_exponent value is 0.7.
However, that means that for two features, they would only get a size of 1 or 2 (1.6). This contrasts sharply with the language models that I know. Take for instance, BERT, which has token (30k values), segment (two values), and position (512 values) which all have 512 dimensions, even the segment embeddings.
My question thus ends up being: I always thought that the number of items in the embedding should more or less dictate the hidden size of that embedding (as onmt suggests), but BERT and siblings do not do this. So what is the best way, and why? How come that only two features in a 512 dimension space make sense?