How to exclude number and URL from vocabulary in translation?

I would like to make URL and Number (ex: phone number) as special token (, or such) in vocabulary list. In translation, it will re-write number or url from source to corresponding place in destination sentence.

How should I do that ? Is that function have been implemented yet. If not, where should I start ?

Input: I have 2 cat and 3 dog. Please call me at 823123532 if you see them.
Expected output: Tôi có 2 con mèo và 3 con chó. Vui lòng gọi cho tôi theo số 823123532 nếu bạn thấy chúng.

My situation now that the translation machine trying to put number, url into vocabulary list. So, sometime, I see weird number/url in my destination sentence.


This has been discussed several times, see for example:

Short answer:

  1. preprocess the data by replacing entities with placeholders
  2. postprocess the translation to inject the original value using some sort of alignment.

Here is my approach.
I replace number to num, url to url and save to list.
After translate, I replace num, url from the list.

Please review if my code work correctly or any performing problem.

def preprocess(text):
    replace url and number with _url_ , _num_
    :param text:
    text = text.lower()

    num_re = r'[-+]?\d*\.\d+|\d+'
    WEB_URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""

    _num_list = re.findall(num_re, text)
    _url_list = re.findall(WEB_URL_REGEX, text)

    result = re.sub(num_re, " _num_ ", text)
    result2 = re.sub(WEB_URL_REGEX, " _url_", result)
    toks = word_tokenize(result2)

    return " ".join(toks), _num_list, _url_list

def postprocess(text, _num_list, _url_list):
    replace _url_ and _num_ with original url and number
    :param text:

    text_tok = text.split(" ")
    for i in range(len(text_tok)):
        if text_tok[i] == "_num_":
            try: # make sure empty list dont crash
                num = _num_list.pop(0) # pop first element
                text_tok[i] = str(num)
                pass # do nothing
        elif text_tok[i] == "_url_":
                url = _url_list.pop(0) # pop first element
                text_tok[i] = str(url)

    return " ".join(text_tok)

Hello @ttpro1195, shortcoming with your approach is that you will substitute the entity in the target in the same order than in the source.

You should rather use protected sequences, and integrate them with hooks as detailed here:

also - look at lexical constraints here that will make sure you cannot lose or generate entities from scratch.


What is the best way to do named entity preprocessing and postprocessing in OpenNMT-pytorch ? I understand hooks is solely lua functionality for now ?