You mean the point is also automatically added by ONMT after the PREFIX ?
PREFIX .source_feature_1.dict
Why didn’t it produced new features dict files ? It was the case, previously, without the “-features_vocabs_prefix” option.
You mean the point is also automatically added by ONMT after the PREFIX ?
PREFIX .source_feature_1.dict
Why didn’t it produced new features dict files ? It was the case, previously, without the “-features_vocabs_prefix” option.
Yes the point is automatically added and no features dictionaries were produced because -features_vocabs_prefix
was set…
I will just make an error if -features_vocabs_prefix
is used and no files are found.
In my data, I used 2 features. The former is with very few values (like case_feature), and latter is reduced to 50000 values in the files. I was expecting ONMT to also use 2 dicts with these same properties.
But, looking at the output, it seems to build a single dict for all ?
Vocabulary related logs will be improved as part of this PR:
https://github.com/OpenNMT/OpenNMT/pull/95
Currently you can check how many dictionaries were actually built at the end of the preprocessing when they are saved.
As said, no new dict is saved after the preprocessing (certainly because of the use of the “-features_vocabs_prefix” option).
In fact, the question about the number of dicts used (in the experiment described) is just for my information, since the use of a good prefix (knowing a point is added) now makes it working as I was expecting, with my 2 pre-built features dicts for each language. The log now clearly report about their loading.
I merged what we discussed:
https://github.com/OpenNMT/OpenNMT/commit/0e4fe692f98cc5930a5754ebb9bbf0f4052a9f13
I hope this suits better your needs.
Sorry for a bit of an off-topic question here but is there any special reason why the feature separator is the halfwidth light vertical (ffe8) instead of the simple vertical line (007c)?
Simply because texts are more likely to contain the simple vertical line. For the sake of simplicity, there is no escaping mechanism so we want to minimize as much as we can the likelihood to find the actual separator in the text itself.
I’m just experimenting with this new possibility to set vocab and embedding size for each feature. It would be nice if the LOG would report about used settings at the starting time.