Sentence Boundary Detection for Non-European languages

argosopentech · September 4, 2020, 2:47am

Does anyone have recommendations for doing sentence boundary detection for Chinese and other languages that don’t use periods?

Most of the sentence boundary detection systems I’ve found have used a strategy similar to “Unsupervised Multilingual Sentence Boundary Detection”. This involves attempting to distinguish between the periods that separate sentences and the periods in acronyms and abbreviations. DeepSegment or something like it seems like it may work even though all of the pretrained models are for European languages with periods. Is sentence boundary something that can be done in a general way for all languages or does it require customization for different punctuation systems.

Thanks for any help!

BramVanroy · September 6, 2020, 11:53am

Stanza seems to have quite good performance for sentence segmentation in Chinese. You can have a look: https://stanfordnlp.github.io/stanza/performance.html

argosopentech · September 7, 2020, 1:08am

Thanks, this looks really promising!

ymoslem · February 12, 2022, 9:47pm

The thing about most NLP libraries that support sentence tokenization, including NLTK, spaCy, and Stanza, is that they come with gigantic models.

Recently, I came across pySBD, which works out of the box for 22 languages, including several non-Latin languages. It can be combined with Sentence-Splitter to expand support to about 36 languages. I guess verything else can be defaulted to English.

Kind regards,
Yasmin

argosopentech · February 13, 2022, 11:41am

pySBD looks like a good option for a lot of use cases.

Argos Translate currently uses Stanza, which generally works well but is a large dependency. The issues I’ve had with Stanza aren’t as much that the model is large but it requires Pytorch which is ~1GB.

I’ve also experimented with using a CTranslate2 seq2seq model to do sentence boundary detection. For example:

<detect-sentence-boundary> Max walked down to the pond. He came back with

Max walked down to the pond.

I have this working with a separate 100MB CTranslate2 model but to be practical I think I want to use the same CTranslate2 that is being used for translation. If the model is able to recognize the <detect-sentence-boundary> token and change function then it can use all of the understanding it has of the source language to do its own sentence boundary detection.

panosk · February 13, 2022, 12:40pm

IMO, this problem is better addressed with a library that does text segmentation with full Unicode support and SRX rules.