Multi-way embeddings?

Etienne38 · January 21, 2017, 1:40pm

I would like to experiment something like in this paper you give on your pages:

More precisely, I would like to provide a stream where each word W is linked with a feature F:
W1+F1 W2+F2 W3+F3 …

Is there a way to tell ONMT to use independent embeddings for each part W and F of each token W+F ?

For the training, I may build my own embeddings with one built from W and one built from F, covering all cases W+F of the training set, and tell ONMT to use fixed embeddings. It will be ok, since ONMT will pick all known W+F pairs from the prepared embeddings. But, it won’t work for the translation process. Suppose a word W and a feature F are known, but the combination isn’t in the prepared merged embeddings, ONMT will take the W+F pair as unknown, rather than pick the right embeddings for each.

Is there a simple entry point to do such a thing in ONMT ?

guillaumekln · January 21, 2017, 5:52pm

OpenNMT already supports additional word features with independent and optimized embeddings. See:

http://opennmt.net/Advanced/#word-features

However, providing fix embeddings for these additional features are not supported.

Etienne38 · January 21, 2017, 7:19pm

OK. It’s exactly what I was looking for. I missed the understanding of it. Perhaps because of your case-feature example, too simple. I didn’t understand all the large enabled possibilities.

What is the size of each vector ? If N is the size given form parameter word_vec_size, there is N for each feature ? Is there way to define one size for each ?

Etienne38 · January 21, 2017, 9:16pm

Should I use one of feat_vec_exponent or feat_vec_size options, or both ?

guillaumekln · January 22, 2017, 8:44am

It depends on the -feat_merge policy.

With -feat_merge concat (the default), each embedding is concatenated to the word embedding with a size that depends on the number of value the feature takes. For example, if the feature takes N values its embedding size will be set to N ^ -feat_vec_exponent. There is no theoretical background about this, it just produces reasonable size for features that takes few values (< 100) which are the initial target for these additional features (case, POS, etc.).
With -feat_merge sum, each embedding will have a fixed size -feat_vec_size, then they are summed together and the result is concatenated to the word embedding.

So currently, it does not support manually setting the embedding size for each feature. Let us know your precise use case and if this framework is limiting to you. There are few people that use additional features and even less who bother tweaking the embeddings so I don’t mind breaking these options if needed.

Etienne38 · January 22, 2017, 9:32am

Clear explanation. Thanks !

I would like to use rich features, like lemmas or stems, or some others possibly with combinations (like combining with POS), that could produce as large number of items as the words themselves.

For what I want to do, the right option is certainly “concat”.

Of course, I can tune the “feat_vec_exponent” option. But it would be interesting for me to get a clear option to manually tune these embedding sizes, for each extra feature.

guillaumekln · January 22, 2017, 9:58am

Maybe we should only have -feat_vec_size that supports a list of comma-separated sizes. The downside is that it is somewhat error prone (must match the number and order of features).

Also note that features vocabularies are not pruned currently (unlike word vocabularies). Should we also provide a way to limit their sizes? An easy way would be to use reuse -src_vocab_size for each feature just to ensure that the vocabulary size does not explode. What do you think?

Etienne38 · January 22, 2017, 10:11am

I would prefer a comma separated list of sizes, so that I can tune each, knowing the possible size of its vocab.

For the pruning, perhaps it would be useful to also have a boolean option for each, and a size for each (?), depending of what kind of feature it is.

Etienne38 · January 22, 2017, 10:18am

Why not an unified way for all item of the N-stream (including words):
-src_vocab_size 50000,0,10000
-word_vec_size 500,20,200

In this case “-src_vocab_size 0” means “do not prune”.

PS : updated

guillaumekln · January 25, 2017, 10:37am

I liked this specification at first but we actually need to make the difference between source and target features embedding sizes. I don’t think there is another way than introducing -src_feat_vec_size and -tgt_feat_vec_size options.

Etienne38 · January 25, 2017, 10:43am

The problem is the same for the standard word embeddings. You rather need this:
-src_vocab_size 50000,0,10000
-src_word_vec_size 500,20,200
-tgt_vocab_size 50000,0,10000
-tgt_word_vec_size 500,20,200

guillaumekln · January 25, 2017, 1:19pm

https://github.com/OpenNMT/OpenNMT/pull/95

Etienne38 · January 25, 2017, 2:06pm

For the non breaking status, in fact, you will be obliged to break the existing “word_vec_size” option, to get a “src” and “tgt” version of it.

For what it brings, the main goal is to switch from heterogeneous word+feats streams, to unified multiplexed NM-streams (N src + M tgt). It can open many use cases at experimentations.

guillaumekln · January 25, 2017, 2:08pm

I kept -word_vec_size. If set to a non zero value, it is used as the word embedding size of the encoder and decoder.

Etienne38 · January 27, 2017, 10:21am

To be exhaustive, you also need to make these options compatible with a N-stream computing:
-src_vocab
-tgt_vocab

In my case, it would be very useful. I would like to train with a set of specific and generic data. It would be interesting to provide my own built dics, where I force all the specific vocab to be in the dic, while only generic vocab is pruned (for words, and for features).

guillaumekln · January 27, 2017, 10:51am

Mmh, yes. However, I would not like to pass a comma-separated list of paths to the command line as you usually rely on the shell auto-completion.

You could work around that by using the -features_vocabs_prefix. You just need to name the dictionaries correctly, for example:

data/mydicts.source_feature_1.dict
data/mydicts.source_feature_2.dict
data/mydicts.source_feature_3.dict

and set -features_vocabs_prefix data/mydicts.

Etienne38 · January 31, 2017, 9:26am

I’m pre-building dicts, and i’m using the “-src_vocab” and “-features_vocabs_prefix” parameters to tell ONMT to use them. I wonder what “preprocess.lua” is doing while reading the files. Is it building new dics from the data ?

... Reading source vocabulary from '/home/dev8/OpenNMT-master/--/train.tsv.src.dict'... Loaded 50005 source words Building source vocabulary... Created dictionary of size 50004 (pruned from 203807) ...

guillaumekln · January 31, 2017, 10:12am

Oh, this is a log inconsistency. It is correctly reusing the dictionary but it displays a wrong message when building the features vocabularies (according to the output, you did not set -features_vocabs_prefix).

It will be fixed.

Etienne38 · January 31, 2017, 10:18am

Why ? As said, I used both “-src_vocab” and “-features_vocabs_prefix”. No new “.dict” was created by “preprocess.lua”. At the end of the processing, the folder only contains my own prepared dicts.

guillaumekln · January 31, 2017, 10:24am

Currently, if the logs contain:

Building source vocabulary...
Created dictionary of size 50004 (pruned from 203807)

it means that either the source word vocabulary is missing or the features vocabularies or both. So I think the path prefix given to -features_vocabs_prefix is wrong as the script fails silently (it should be more verbose here).

Maybe my previous message formatting was ambiguous, it is -features_vocabs_prefix data/mydicts not -features_vocabs_prefix data/mydicts. .