My model tries to translate:
Barack_Obama president United_States
to:
Barack_Obama is the president of the United_States
Where both Barack_Obama and United_States are entities I know of.
I wish that my model will learn to copy the names instead of producing them from a distribution, such that if my validation data contains:
Theresa_May president United_Kingdom
It will still know how to translate it, just “copy” the entities.
The way I want to do that is to remove all entities from the vocabulary, such that the model will see:
UNK president UNK
How can I remove a known set of words from both the input and output vocabularies?
Bonus:
UNK president UNK
Is not the most informative. is there a way to incorporate a character embedding as well as the word embedding? that way, while hopefully still copying the entities it would know to distinctly translate:
of England
of the United_Kingdom