Multi-way embeddings?

Etienne38 · January 31, 2017, 10:31am

You mean the point is also automatically added by ONMT after the PREFIX ?
PREFIX .source_feature_1.dict

Why didn’t it produced new features dict files ? It was the case, previously, without the “-features_vocabs_prefix” option.

guillaumekln · January 31, 2017, 10:35am

Yes the point is automatically added and no features dictionaries were produced because -features_vocabs_prefix was set…

I will just make an error if -features_vocabs_prefix is used and no files are found.

Etienne38 · January 31, 2017, 10:55am

In my data, I used 2 features. The former is with very few values (like case_feature), and latter is reduced to 50000 values in the files. I was expecting ONMT to also use 2 dicts with these same properties.

But, looking at the output, it seems to build a single dict for all ?

guillaumekln · January 31, 2017, 11:00am

Vocabulary related logs will be improved as part of this PR:

https://github.com/OpenNMT/OpenNMT/pull/95

Currently you can check how many dictionaries were actually built at the end of the preprocessing when they are saved.

Etienne38 · January 31, 2017, 11:08am

As said, no new dict is saved after the preprocessing (certainly because of the use of the “-features_vocabs_prefix” option).

In fact, the question about the number of dicts used (in the experiment described) is just for my information, since the use of a good prefix (knowing a point is added) now makes it working as I was expecting, with my 2 pre-built features dicts for each language. The log now clearly report about their loading.

guillaumekln · January 31, 2017, 3:08pm

I merged what we discussed:

https://github.com/OpenNMT/OpenNMT/commit/0e4fe692f98cc5930a5754ebb9bbf0f4052a9f13

I hope this suits better your needs.

oraveczcsaba · February 14, 2017, 8:12am

Sorry for a bit of an off-topic question here but is there any special reason why the feature separator is the halfwidth light vertical (ffe8) instead of the simple vertical line (007c)?

guillaumekln · February 14, 2017, 8:42am

Simply because texts are more likely to contain the simple vertical line. For the sake of simplicity, there is no escaping mechanism so we want to minimize as much as we can the likelihood to find the actual separator in the text itself.

Etienne38 · February 24, 2017, 3:14pm

I’m just experimenting with this new possibility to set vocab and embedding size for each feature. It would be nice if the LOG would report about used settings at the starting time.

in case of automatic embedding sizes, it will let me know about what size is used.
in case of individual settings in parameters, it will let me know that all is properly taken into account, so that used values are really the specified ones

guillaumekln · February 24, 2017, 5:21pm

https://github.com/OpenNMT/OpenNMT/commit/72187f6a1cb4658eca2acaa11eaa1c0908cd6368