Remove words from vocabularies

AmitMY · September 1, 2018, 9:21am

My model tries to translate:

Barack_Obama president United_States

to:

Barack_Obama is the president of the United_States

Where both Barack_Obama and United_States are entities I know of.

I wish that my model will learn to copy the names instead of producing them from a distribution, such that if my validation data contains:

Theresa_May president United_Kingdom

It will still know how to translate it, just “copy” the entities.

The way I want to do that is to remove all entities from the vocabulary, such that the model will see:

UNK president UNK

How can I remove a known set of words from both the input and output vocabularies?

Bonus:

UNK president UNK

Is not the most informative. is there a way to incorporate a character embedding as well as the word embedding? that way, while hopefully still copying the entities it would know to distinctly translate:

of England
of the United_Kingdom