OpenNMT Forum

How to split large paragraph into list of sentences?


I’m trying to translate large paragraph with my model. Thing is, when I pass the list of sentences; the accuracy of translation is pretty high compared to when I translate the entire paragraph.

Is there any way I can use OpenNMT/Tokenizer to get the sentence boundaries of input paragraph? Or perhaps, any other way to split paragraph into list of sentences?


(Tomas V) #2

I have had the same thing for my LibreOffice translation extension.
As a stop-gap measure, I used spacy, which provides this function. But I would be much happier to see something for this in OpenNMT-py. My understanding is that spacy more or less uses POS tagging for this or so.

Best regards