I have a translation model trained on mostly single sentences with less than 50 tokens.
For translation, I have implemented length and coverage penalty to the beam search. This helps translate smaller multiple sentences. However, I was wondering if there is an optimal way to translate long sentences or paragraphs (longer than 50 tokens). I have seen discussion in this forum about using sentence splitters like NLTK PunktSentenceTokenizer. But, this is not useful if the sentence is more than 50 tokens long or doesn’t include proper punctuations.
Do you know what industry standard is to handle such cases?