Options to translate long sentences or paragraphs

Hi all,

I have a translation model trained on mostly single sentences with less than 50 tokens.

For translation, I have implemented length and coverage penalty to the beam search. This helps translate smaller multiple sentences. However, I was wondering if there is an optimal way to translate long sentences or paragraphs (longer than 50 tokens). I have seen discussion in this forum about using sentence splitters like NLTK PunktSentenceTokenizer. But, this is not useful if the sentence is more than 50 tokens long or doesn’t include proper punctuations.

Do you know what industry standard is to handle such cases?

Thanks.

Hi,

To minimize information loss in the translation, it is usually required to truncate the sentence to the maximum length seen during the training, then translate the parts separately and finally merge the translations. The final result may not be very good but at least you don’t drop large portions of the input.

To improve the result, you can apply the same splitting to your training data so that you model also see sentence parts in addition to complete sentences. You could also mark such splitted sentences with a custom token for the model to learn that they are somewhat different.

2 Likes

@guillaumekln, thanks!