Paragraph vs sentence segmentation

Hello guys,

This is not new that long sentences are still an open research issue.
I would like to submit an idea. This could be handled outside Onmt but there might be an interest to include it.

When comparing some results between Google Translate and Onmt, it is obvious that some miulti sentence paragraph are better handled because (I presume) GT segment the paragraphs in multiple sentences.
Similarly, I observed that the CommonCrawl corpus, and maybe some others (at least I also observed this in a in-house memory) include a lot of multi-sentences paragraph in one line.
Of course that may be filtered by the max_seq_length.

It might be good, at least at inference, to detect to stop words, or sentence boundaries punctuation to try to better handle long sentence or even multi-sentence.

At training (and inference) I am wondering if there could be some benefit to handle multiple attention when a sentence boundary is detected. Maybe this is part of the local attention concept.

Anyway, long sentences are really an issue :slight_smile:


Did you take a look at the NLTK sentence tokenizer?

A good sentence splitter is not an easy task, especially if you need to apply it on parallel sentences (for training data).

@vince62s - Unless you know the sentences in parallel paragraphs are equal in number and sequentially parallel, you can quickly run into problems with trying to segment data aligned on the paragraph level.

I’d be keen to see research into handling much longer sequences, though. That way you could even train on transcreated data (using words/sub-word units), and doing character-segmentation of sentences might be more robust…