Recommendations for breaking down input text for inference

argosopentech · August 29, 2020, 3:38pm

I have an en-es model trained on single sentences and would like to be able to run inference for larger blocks of text. Any recommendations for how to do this? For this model and other European languages it seems reasonable to split on periods and then translate each sentence independently. However, in the future I’d like to do other languages (Chinese and Arabic) which don’t necessarily use periods for punctuation. Additionally, I’d like to reasonably handle European languages with non standard punctuation in poetry, lyrics, etc…

I’m wondering if there is a standard approach for this?

guillaumekln · August 31, 2020, 8:04am

Not sure if this is standard but NLTK has a module that can train a sentence segmenter:

https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt

argosopentech · August 31, 2020, 1:03pm

Thanks! I’ll check it out