This is some information about how I’ve done sentence boundary detection for Argos Translate using CTranslate2 that I wanted to share. If you have suggestions for improvements please let me know.
The way Argos Translate currently translate strings of text is it first splits the text into sentences and then translates each sentence independently. This allows us to use small and efficient neural networks that can only handle 150 characters of input. However, since each sentence is translated independently there’s no mechanism for context from one sentence to influence the translation of nearby sentences. In most cases this is fine and translating each sentence independently leads to understandable results. This method also requires a method for splitting a string of text into discreet sentences.
Argos Translate currently uses Stanza to detect sentence boundaries and split text into sentences. Stanza has worked very well for us, and supports a large number of languages, but it’s a bit clunky and slow. Stanza uses neural networks to detect sentence boundaries in text, but also does a lot of other things like identifying parts of speech.
There are a several other Python libraries available to detect sentence boundaries. Many of them use a set of rules to determine which periods in a sentence are a sentence boundary and which are an abbreviation like “Dr.” or “P.J.”. This works well in a lot of cases but it can miss some nuance and doesn’t work well for non-European languages that don’t use periods.
My plan going forward is to use CTranslate2, which is the Transformer inference engine for Argos Translate, to detect sentence boundaries. I have this implemented for the v2 development version of Argos Translate but am not currently using it on the master branch.
The way this works is that instead of translating between languages like “en”->“de” with a Transformer neural network I translate from “string of text”->“the first sentence of original string of text”. Then I use a string similarity metric to match the output text from the Transformer model to a substring in the original text.
<detect-sentence-boundary> This is the first sentence. This is more text that i
This is the first sentence. <sentence-boundary>
In my experience this works pretty well. But for now the Stanza library works better in most cases. It also allows me to use one Transformer inference engine for both translation and sentence boundary detection. This is beneficial because it reduces the total dependencies required by Argos Translate (Stanza requires PyTorch which is ~1GB to install).
One issue with using the CTranslate2 seq2seq Transformer model for sentence boundary detection is that it it can be difficult to find high quality training data since this is a niche task. To create data I appended unrelated sentences from datasets from Opus together to create fake sentence boundaries to train the Transformer model on. For example:
<detect-sentence-boundary> This is a sentence from the Europarl dataset. This is an unrelated sentence in the same dataset
This is a sentence from the Europarl dataset. <sentence-boundary>
Going forward I also want to experiment with using rules based sentence boundary detection systems to generate synthetic data for neural network based sentence boundary detection. This could be done by taking unstructured text data, splitting it into sentences with a rules based system, and then using the split sentences as training data for the neural network based system.