Hello guys,
This is not new that long sentences are still an open research issue.
I would like to submit an idea. This could be handled outside Onmt but there might be an interest to include it.
When comparing some results between Google Translate and Onmt, it is obvious that some miulti sentence paragraph are better handled because (I presume) GT segment the paragraphs in multiple sentences.
Similarly, I observed that the CommonCrawl corpus, and maybe some others (at least I also observed this in a in-house memory) include a lot of multi-sentences paragraph in one line.
Of course that may be filtered by the max_seq_length.
It might be good, at least at inference, to detect to stop words, or sentence boundaries punctuation to try to better handle long sentence or even multi-sentence.
At training (and inference) I am wondering if there could be some benefit to handle multiple attention when a sentence boundary is detected. Maybe this is part of the local attention concept.
Anyway, long sentences are really an issue