to add up (a bit late) on this thread: yes, entity normalisation is important and even if you cannot except jump in your score, it will make a big difference for your users. For dealing with that, we have introduced monolingual preprocessing hooks and protected sequence to seamless deal with them.
In short - the process is the following:
- define a mpreprocessing hook that will locate your favorite entities and annotate them with protected sequence markers - typically for uri:
transform that into:
check out ｟URL：http://myurl.com/1234｠!
Note that there are 2 fields in the protected sequence separated by this strange
： character (it is not
- the entity name
- the actual value
This notation turns automatically the entity as a unique ｟URL｠ vocab, while the second field (the actual value), is used in the detokenization within inference to substitute the actual value.
Of course, you can also perform preprocessing outside of the OpenNMT code (i.e. without a hook), but defining it as a hook guarantees you that inference and training and identical, and you don’t need to add additional preprocessing layer in the inference code.
check New `hook` mechanism for more details on hooks, and http://opennmt.net/OpenNMT/tools/tokenization/ for more details on protected sequence!