Multi-way embeddings?

You mean the point is also automatically added by ONMT after the PREFIX ?
PREFIX .source_feature_1.dict

Why didn’t it produced new features dict files ? It was the case, previously, without the “-features_vocabs_prefix” option.

Yes the point is automatically added and no features dictionaries were produced because -features_vocabs_prefix was set…

I will just make an error if -features_vocabs_prefix is used and no files are found. :wink:

1 Like

In my data, I used 2 features. The former is with very few values (like case_feature), and latter is reduced to 50000 values in the files. I was expecting ONMT to also use 2 dicts with these same properties.

But, looking at the output, it seems to build a single dict for all ?

Vocabulary related logs will be improved as part of this PR:

https://github.com/OpenNMT/OpenNMT/pull/95

Currently you can check how many dictionaries were actually built at the end of the preprocessing when they are saved.

As said, no new dict is saved after the preprocessing (certainly because of the use of the “-features_vocabs_prefix” option).

In fact, the question about the number of dicts used (in the experiment described) is just for my information, since the use of a good prefix (knowing a point is added) now makes it working as I was expecting, with my 2 pre-built features dicts for each language. The log now clearly report about their loading.

1 Like

I merged what we discussed:

https://github.com/OpenNMT/OpenNMT/commit/0e4fe692f98cc5930a5754ebb9bbf0f4052a9f13

I hope this suits better your needs.

1 Like

Sorry for a bit of an off-topic question here but is there any special reason why the feature separator is the halfwidth light vertical (ffe8) instead of the simple vertical line (007c)?

Simply because texts are more likely to contain the simple vertical line. For the sake of simplicity, there is no escaping mechanism so we want to minimize as much as we can the likelihood to find the actual separator in the text itself.

I’m just experimenting with this new possibility to set vocab and embedding size for each feature. It would be nice if the LOG would report about used settings at the starting time.

  • in case of automatic embedding sizes, it will let me know about what size is used.
  • in case of individual settings in parameters, it will let me know that all is properly taken into account, so that used values are really the specified ones

https://github.com/OpenNMT/OpenNMT/commit/72187f6a1cb4658eca2acaa11eaa1c0908cd6368

1 Like