Hi, I would like to train a Transformer model and tag source subwords with additional information. For my first research experiment I have added POS, Named Entity and domain information (generic/in-domain).
My training data is tokenized.
Here is an example of a source sentence:
“Austria – works at hospital in Spittal / Drau , Carinthia”
And now in subwords:
[‘▁Austria’, ‘▁–’, ‘▁works’, ‘▁at’, ‘▁hospital’, ‘▁in’, ‘▁Sp’, ‘ittal’, ‘▁/’, ‘▁Dra’, ‘u’, ‘▁,’, ‘▁Carinthia’]
My input features are:
- Feature 1: POS:
PROPN PUNCT VERB ADP NOUN ADP PROPN PROPN SYM PROPN PROPN PUNCT PROPN
- Feature 2: NE:
L O O O O O R R R R R O P
(L- Location, R-Organization, P-Person, O-Other)
- Feature 3: domain:
0 0 0 0 0 0 0 0 0 0 0 0 0
(0-generic domain, 1-in-domain segment)
Each input feature has the same number of subwords, as the original segment (in subwords). In total there are 13 subwords, and each feature has 13 signs.
First, I would like to learn the vocabulary and use the
onmt_build_vocab command. For this post, I learned the vocabulary with the single segment from the example, no other data has been used.
I use the following transformations:
transforms: [sentencepiece, filtertoolong, filterfeats, inferfeats].
onmt_build_vocab runs through without any other errors.
This is the output from vocab file for feature 2:
- Why am I getting a (null) feature if every subword is properly tagged?
- There are 5 occurrences of Organization (“R”) as tagged in the training data. Why does the output file says it is only 4? There are also other inconsistencies in the counter for other features as well, e.g. for POS.