Inference best practices

Nart · July 13, 2020, 7:33am

It would be great if there is some sort of a guide for inference best practices, especially for people like me, who have very little knowledge of NLP.

For example, I have noticed that when I tokenize a paragraph into sentences, then infer those as separate tokens, the translation seems better, and the error in translation become localized, so part of the paragraph is wrong, rather than passing a whole paragraph and get nonsense.

Another thing I’ve noticed. Simple changes in word tokenization can effect inferring easily. ("_He llo" vs “_He llo.”), it is a little strange why adding a period would effect translation.

panosk · July 13, 2020, 10:02am

Hi,

Both observations are quite normal. It all comes down to your training data.

In the first case, if your training corpus contains mostly separate sentences with few or none small paragraphs, then you can expect that inference will be bad even for small paragraphs with 2-3 sentences as they are treated as single strings. Typically, NMT models are trained on sentence level.

The period or other punctuation marks can also make a huge difference. One common scenario is titles vs instructions. Consider this simple sentence:

Open documents

This is usually a title as there is no period, and in many languages “Open” will be translated as noun.

Adding a period changes the meaning completely because now the sentence is in imperative:

Open documents.

Here “Open” is a verb and a model that is trained with both instances will know the difference and change the translation accordingly.