Recommendations for breaking down input text for inference

I have an en-es model trained on single sentences and would like to be able to run inference for larger blocks of text. Any recommendations for how to do this? For this model and other European languages it seems reasonable to split on periods and then translate each sentence independently. However, in the future I’d like to do other languages (Chinese and Arabic) which don’t necessarily use periods for punctuation. Additionally, I’d like to reasonably handle European languages with non standard punctuation in poetry, lyrics, etc…

I’m wondering if there is a standard approach for this?

Not sure if this is standard but NLTK has a module that can train a sentence segmenter:

https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt

1 Like

Thanks! I’ll check it out