Train/Infer on paragraphs


I’m planning on training my model with paragraphs structure like this:

Sentence 1. Sentence 2. Sentence 3. Sentence 4.
Sentence 2. Sentence 3. Sentence 4. Sentence 5.
Sentence 3. Sentence 4. Sentence 5. Sentence 6.

where as each line correspond to 1 line of my training files

My plan is that at inference, I will pass “Sentence 1. Sentence 2. Sentence 3.” as a known input and ask the ctranslate2 to complete the translation… so it has more context to translate Sentence 4. Then to translate Sentence 5. i will pass “Sentence 2. Sentence 3. Sentence 4.” as the know input and so on.

Is there any paper related to this? Or has anyone have tried this?


There are some paper relate to this kind approach in Document level machine translation.
For instance, Neural Machine Translation with Extended Context - ACL Anthology


Thank you for the link to the paper.

The paper used slightly different approaches. I believe the approach I want to try will yield better results, but be less efficient. But this is not an issue in my case. It will also provide me better results when I have the real users inputs from the previous sentences. In to generate the “draft” of a document, using this method is kind of guaranteeing a continuity in the context, up to a certain degree.

All this should be true if Ctranslate2 consider the translation provided as input when we request it to complete the reaming of the translation. (which I’m not sure if it does)

This might be useful:

Contextual Handling in Neural Machine Translation:Look Behind, Ahead and on Both Sides


Hello, Samuel!

In addition to the good suggestions by the colleagues, I would like to refer to these two papers. The first paper compares three approaches to Context-aware Neural Machine Translation. The second paper elaborates on the range of context selection.

Kind regards,