Add more XML tags to tokenized corpus

Aleksander · June 1, 2017, 2:56pm

I noticed that some tokenized corpora have tags that others don’t (for example some pt corpora only have “hun” and “id” tags while some es corpora have “tree”, “lem”, “id” and “svmtool”), so is it safe to assume that it’s possible to add any (coherent) tags to tokenized corpora to enhance the training process? My idea is adding something like a “context” tag, since a single word can have multiple meanings depending on the context. Would that be possible? If so, would it make a significant difference when translating?

Thanks in advance

guillaumekln · June 2, 2017, 7:34am

Of course you can add any additional tags or more generally tokens that might help the model to learn its task and disambiguate a part of the sequence.

Would it make a difference when translating? It depends I would say. The model is already quite good at dealing with the context.