That approach should work. Did you already run some experiment?
I think a more robust approach would be to replace all entities by placeholders like
__ent_numeric_time, etc., use attention to align them with source placeholders and finally replace them back with their original value. However, that obviously means you have a system that produces these placeholders and some preprocessing and post-processing steps to replace them both ways.
Another approach is to use sub-tokenization like BPE or wordpieces but that does not completely solve the issue with numbers.