Size of feature embeddings (and some digression about casing methods)

Cross-post from: https://discuss.huggingface.co/t/why-are-segment-and-position-embeddings-so-large/254

These days I am part-time doing work on improving translation models. We are working with regular transformer seq2seq networks using OpenNMT. This question is not about Opennmt specifically but it was triggered by going through the documentation. In onmt one can add features to each word. These features are then used to train their own embedding. For example, if you want to train a lower case model but still want to give importance to casing, you can add a casing feature that indicates whether the word was lower case or not.

i│C like│l cookies│l from│l new│C york│C

This will create two embedding layers under the hood. One for the tokens, and one for the case features.

In the documentation, it states that the default size for features is

… set to N^feat_vec_exponent where N is the number of values the feature takes.

where the default feat_vec_exponent value is 0.7.

However, that means that for two features, they would only get a size of 1 or 2 (1.6). This contrasts sharply with the language models that I know. Take for instance, BERT, which has token (30k values), segment (two values), and position (512 values) which all have 512 dimensions, even the segment embeddings.

My question thus ends up being: I always thought that the number of items in the embedding should more or less dictate the hidden size of that embedding (as onmt suggests), but BERT and siblings do not do this. So what is the best way, and why? How come that only two features in a 512 dimension space make sense?

One motivation is that you could just add the embedding to the main word embedding. When the dimension is different, you need to concatenate them which can produce some odd sizes.

The exact OpenNMT formula is a bit arbitrary but shows there is a non linear relation between the number of items and the embedding size. It also gives sizes that are close to expectation when designing the model.

I just found out Google made a similar recommendation in this blog post:

Well, the following “formula” provides a general rule of thumb about the number of embedding dimensions:

embedding_dimensions = number_of_categories**0.25

If you refer to OpenNMT-py which I believe since you use transformers, then you can read this at the end of the post: https://github.com/OpenNMT/OpenNMT-py/issues/1534

you can specify the feat_vec_size directly because indeed the exposant formula is not really straight forward to adjust the rnnsize.

@guillaumekln @vince62s Thank you for your replies! My question is not about how to make things work with word features and transformers, though. We have successfully trained transformer models with word features. My question is a bit more theoretical in nature.

The models we trained use for instance token embeddings of size 500 and feature embeddings of 12 to get to 512 in total. That makes sense because the default operation on embeddings is concat. However, this is in contrast with transformer-based language models that typically use sum. As @guillaumekln indicates, this can be an upside or a downside. On the one hand, if you concat, you really preserve the features themselves in their own part of the vector. You lose that when you use sum. Sum, on the other hand, allows you to use large embedding spaces for all features (with the restriction that they all have to be the same size).

My question is, then, what is the implication of this. Specifically, how come that these tried-and-proven language models work so well, even if they use e.g. 512 features to model two features (segment embeddings), 512 for 512 positions (positional embeddings), and 512 for 30k tokens. In other words, with respect to onmt, why wouldn’t we use such a configuration for onmt? We can, but why don’t we? Why isn’t it the default? So I am looking for some kind of explanation why such approach works for LMs but might not be ideal for MT.

I hope that my question is clear. If not, let me know.

I don’t know. At times I even wonder of features really helps … since we observed that tags perform as well as features in some instances.

Anyway no reason. it just feels more natural to add a few dimensions for features that take few values rather than summing up. But again no tested rationale.

@vince62s Well, from our own research I can tell that concatenating the features (even with small sizes such as 6 dim) can lead to (small) improvements so it does help, I would say, depending on your use case. What exactly do you mean by tags? Is this something that onmt supports?

When I find the time I will try running our experiments again but instead of concatenating, I’ll sum. I’ll report back and let you know if we found discernible results.

re: tags / placeholders
an example here with case_markup

That’s interesting. So basically it allows you annotate chunks of text with tags. But what does this do under the hood? I would expect that the result is the same, namely that it would also just create a feature embedding. Or does it do something else?

These “tags” do nothing. They are just additional tokens.

The intuition being that the system learns to see these tokens as special boundaries? If you have a paper about this, I’d be interested.

Think of it as punctuation for instance. When you have some opening/closing parentheses, or quotes, these are properly placed at inference.
These tags are a bit like that, with an opening and a closing, properly replaced on target side based on source.
I don’t have in mind a paper detailing the topic though.

I understand what they are supposed to do, theoretically, but I am curious how well the system actually learns that these “opening and closing”. But since you said that you found that it works well, I’ll take your word for it.

That’s why I gave the example of opening and closing punctuation. Basically if you swap for instance your “(” token with “⦅_opening_⦆” and “)” token with “⦅_closing_⦆”, your model will probably already partly know how to handle such cases.

I don’t think there are really any specifics to look for in the way the model would learn this. As Guillaume said, these are just tokens that are learned to be generated in the sequence, like any other one.

By the way, found this paper which mentions a similar technique (calling it “inline casing”), without giving much more details though: https://europe.naverlabs.com/wp-content/uploads/2019/07/WMT_Robustness_NLE-2.pdf

This paper seems to go deeper in the subject: https://www.aclweb.org/anthology/2020.lrec-1.463.pdf
Not familiar with their tasks but it seems they agree in saying “inline casing” is effective.

Thanks for the reference. I understand what you mean, but I am a little bit surprised. Yes, these are just special tokens that need to be learned, just like any other. But the difference is that they imply that what is inbetween them has some special feature (e.g. cased). So the model must learn somehow that everything between these two special tokens must have a special feature (e.g. casing), presumably by means of attention. (So I would expect that if you look at the self-attention weights that everything inside that span pays significant attention to the opening and closing tag.)

The model does not learn the casing per se. It’s applied as a postprocessing step (when detokenizing for instance), looking for the generated tokens.

E.g. to translate
"J'ai mangé une POMME."
we tokenize/apply the casing tags
"<U> j </U> []' []ai mangé une <U> pomme </U> []."
infer
"<U> i </U> ate an <U> apple </U> []."
postprocess
"I ate an APPLE."

("[]" represents the joiner here)

So yes, it’s just a token like any other. And the tokens inbetween as well.

Maybe there would be some effect in the attention weights, but nothing especially related to the feature, as it does not know it’s a feature, it’s just some tokens in the sequence. And not sure it would specifically attend to these opening/closing tags, as the whole point of this is to be able to handle text irrespective of casing.

Ah, now I see. I thought the point was that these tokens would not be predicted. So that your source is uncased with features and that the output would be cased, leaving it up to the system to figure out when to use casing. But now I see that the casing is indeed produced as tokens and then applied in post-processing. Thanks for the clarification.

There was an experimental error in my previous post. Because I ran the last experiment with more GPUs than the previous one, the results might not be comparable (see this question). I re-ran the experiment with the same number of GPUs and did not find any conclusive differences in performance. However, my gut says that this might be a problem of hyperparameter optimization, too. For now, we’ll stick with the default concat.