I am thinking to a way to ensure that numbers and other named entities appearing in the source are well translated in the NMT output.
Do you think that training the network with named entities tag as word features both on the source and on the target side could be a good way to obtain the desired result?
At decoding time, one could look for all tokens in the output sentence which have been tagged as numbers (or another named entity) and compare their associated token with the word in the source sentence with the highest attention score.
That approach should work. Did you already run some experiment?
I think a more robust approach would be to replace all entities by placeholders like
__ent_numeric_time, etc., use attention to align them with source placeholders and finally replace them back with their original value. However, that obviously means you have a system that produces these placeholders and some preprocessing and post-processing steps to replace them both ways.
Another approach is to use sub-tokenization like BPE or wordpieces but that does not completely solve the issue with numbers.
Thank you! I have not tested the approach yet. How can one get the attention information? I have seen that there is the -withAttn option for the translation in server mode but not for standard batch translation using translate.lua
Indeed, there is no option in
translate.lua to dump attention vectors. But it is easy to retrieve:
diff --git a/translate.lua b/translate.lua
index cff64fe..5f321f8 100644
@@ -127,6 +127,7 @@ local function main()
_G.logger:info("PRED %d: %s", sentId, sentence)
_G.logger:info("PRED SCORE: %.2f", results[b].preds[n].score)