OpenNMT Forum

Sentence Boundary Detection for Non-European languages

Does anyone have recommendations for doing sentence boundary detection for Chinese and other languages that don’t use periods?

Most of the sentence boundary detection systems I’ve found have used a strategy similar to “Unsupervised Multilingual Sentence Boundary Detection”. This involves attempting to distinguish between the periods that separate sentences and the periods in acronyms and abbreviations. DeepSegment or something like it seems like it may work even though all of the pretrained models are for European languages with periods. Is sentence boundary something that can be done in a general way for all languages or does it require customization for different punctuation systems.

Thanks for any help!

Stanza seems to have quite good performance for sentence segmentation in Chinese. You can have a look:

1 Like

Thanks, this looks really promising!