Train/Infer on paragraphs


I’m planning on training my model with paragraphs structure like this:

Sentence 1. Sentence 2. Sentence 3. Sentence 4.
Sentence 2. Sentence 3. Sentence 4. Sentence 5.
Sentence 3. Sentence 4. Sentence 5. Sentence 6.

where as each line correspond to 1 line of my training files

My plan is that at inference, I will pass “Sentence 1. Sentence 2. Sentence 3.” as a known input and ask the ctranslate2 to complete the translation… so it has more context to translate Sentence 4. Then to translate Sentence 5. i will pass “Sentence 2. Sentence 3. Sentence 4.” as the know input and so on.

Is there any paper related to this? Or has anyone have tried this?


There are some paper relate to this kind approach in Document level machine translation.
For instance, Neural Machine Translation with Extended Context - ACL Anthology


Thank you for the link to the paper.

The paper used slightly different approaches. I believe the approach I want to try will yield better results, but be less efficient. But this is not an issue in my case. It will also provide me better results when I have the real users inputs from the previous sentences. In to generate the “draft” of a document, using this method is kind of guaranteeing a continuity in the context, up to a certain degree.

All this should be true if Ctranslate2 consider the translation provided as input when we request it to complete the reaming of the translation. (which I’m not sure if it does)

This might be useful:

Contextual Handling in Neural Machine Translation:Look Behind, Ahead and on Both Sides


Hello, Samuel!

In addition to the good suggestions by the colleagues, I would like to refer to these two papers. The first paper compares three approaches to Context-aware Neural Machine Translation. The second paper elaborates on the range of context selection.

Kind regards,


After reading all the papers,

I decided to give it a go. Here how I did it:

  1. I aligned my data in something which I call a caption (can be a full sentence or not depending on the size)
  2. I created an ID to keep track of the caption that are consecutive (from the origine of their text).
  3. I filtered any caption that didn’t fit in my filtering rules
  4. I grouped the captions. All captions that are still consecutive compare to the original text are grouped together. (this is achieved with the consecutiveness of the ids)
  5. Calculated the number of caption per group and randomize the groups.
  6. Determined the number of caption I wanted for training/testing/validation
  7. Splited the groups in 3 categories in order to have more or less the target number of captions for training/testing/validation
  8. Generated the concatenation of the caption in order to have 3 captions per line:
    Caption1 + Caption2 + Caption3
    Caption2 + Caption3 + Caption4
    Caption3 + Caption4 + Caption5
  9. Kept the individual caption in the batch too…
  10. Applied augmentation technique on top of these.

I generated different formats to facilitate the testing afterwards. I have one file with only the 3 captions merged together and 1 file with just the basic caption.

So far, this is a graph I obtained with my previous model which was trained with the same thing except for the merged 3 captions:

Note for graphs:

  • the % at the top represent the % of the test case that are in present in the section before the line
  • each dots represent a segment that was translated.

-Based on the same data, but not the same testing/training/validation files.

This is the results with 3 captions, but testing with single caption (not merged). So Bleu score should be comparable.

So far, if we compare both graphs we can noticed about + 3 BLEU in average. Data is comparable.

and finally this is test set with the 3 captions merged.

It’s can’t be compared to the 2 others conceptually, but there is something interesting in there… There are really few dots at the bottom left. Which is when the model is confident, but yet there is a big gap with the expected. I have found that this zone is either lack of context or golden record was wrong.
I believe the BLEU score also have a bios. The more words you have the less it becomes accurate, but I see this pattern also in my graphs for WER/hLEPOR/METEOR

So it seems that the model trained with more context will still improve smaller sentences.

here the resume of my first model without 3 captions merged:

model with 3 captions merged but only tested on single captions:

model with 3 captions tested with 3captions set:

next step is to use my model with 3 captions and the testing batch with 3 captions, but translate “knowing” the 2 first and see how it improve the BLEU score. I’m having some dificulty to make it work in ctransalte2… will update this post when I get the graphs.

Best regards,


Thanks, Samuel, for sharing your experiments is such detail!

I am not sure if I missed this, but do you notice any improvement in BLEU when you translate one caption, and compare the score with and without context?

Another question, my understanding is that you used simple concatenation, is this true?

I would like to refer to this paper as well:

Kind regards,

1 Like

Hello Yasmin,

This is the next test I wanted to do (mentioned at the end of my text) :wink:

But haven’t been able to make ctranslate2 to work to provide begging of the translation.

As for your other question, if it’s a simple concatenation… i put a special tag between each captions. This is to facilitate the retrieval of each one of them. Also, the model is use in a CAT tool, i want to be able to split what ever i merged.

Now, I’m going to test with different quantity of caption concatenated and see how the model improve and find the sweet spot.

But so far even if i wouldn’t use the feature where i provide the beginning of the translation, i already see a grain in the model. I would argue that this technique could be considered data augmentation.

Best regards,


So after some test…

(please refer to my previous post for how I structured my dataset)

but for quick recap:

  • validation/testing/training dataset use exclusive captions.

I generated many training dataset where has i was merging in a logical way the “captions”.

first dataset had only 1 caption per segment.
Second one had everything from the first dataset + all the possible logical combination of 2 captions
And so one… till i reach 6 captions merged together.

Based on the last dataset (the one with 6 captions), I used the testing set with 6 captions and translated
the first, second, third,… fifth captions independently and compare the BLEU score to the original. I also provided the “original caption” for the kwnon part. So the known part was not machine translated.

So I don’t believe there is a way to have it more appels to appels than that.

The results (avg BLEU score):
no known captions: 40.648254
1 known caption: 41.972118
2 known captions: 41.691708
3 known captions: 41.868456
4 known captions: 41.749295
5 known captions: 41.996849

I noticed that the model is more prone to have the repetition of the same string (which usually don’t have an impact on the BLEU score, but does have an impact on the WER score. This is clearly because the model was trained with segment of different length. I’m planning to fix that with the </s> in the target and if it’s not enough create a custom tag at the beginning of source and target that represent the “length” of the segment. I’m hoping that the model learn the link between the tag and the length, and then learn better when to stop.

There are so many scenarios to try, but so far I’m satisfied with what I see, and it should be enough for my needs.


Thanks, Samuel, for the comprehensive results and comparison!

I think this happens when the training data is small. Have you tried the translation option block_ngram_repeat in OpenNMT-py or repetition_penalty in CTranslate2?

Kind regards,



I haven’t, but will certainly have a look at it.

Thank you!

1 Like

I just figured out that my tests were not really valids as I had forgotten to change the max length of source and target in opennmt-tf. So it was limited to 250 chars. I should have increased it to 1700.

I’m not sure what is the “impact” of having so long string. I’ve been checking on the internet, but can’t find any reason why we wouldn’t want too long strings?. Google automl recommend max 50 words.

I’m going to retry partialy the testing with a higher limitation on the number of chars.

Best regards,¸


Having a long string takes very long to train/infer (not to mention the chances of having OOM issues. Transformers scale as O(n²d) where d is the feature dimension and n is the sequence length. So as the sequence (number of tokens) increases, your calculations will take longer. (Transformers improved over the previous Recurrent NN which went O(nd²) which is why it can generally train faster)

You can also check out this on the marian-nmt Github

The problem here depends on how you have implemented your document-level system. With current Marian I would say there are two ways to achieve that out-of-the-box with no to little need to code anything:

The effect you wish for should appear automatically if you are lucky.


Hi Samuel!

These are good experiments! Keep us informed.

Length is calculated by token, not characters. As you said, you have to change: maximum_features_length, and maximum_labels_length, as well as maximum_decoding_length.

Kind regards,

Thank you Yasmin an James,

Out of curiosity, what would be you recommandation for the validation set?

Currently, i’m using only the combination of 6 sentences, but my training contain all ngrams + combination or 2 to 6 sentences combined (when they were consecutives) + data augmentation (no punct, etc)

I wasn’t sure if I should have kept 2 to 6 sentences combined in valdiation set. I did not, because the impact is too strong on the bleu score generated at check point. But I’m not sure what is the impact of the composition of the validation set on the training.

Best regards,

Hi Samuel!

If your purpose is to assess the effect of translation on paragraphs (i.e. context-aware / document-level NMT), then you definitely need to have combined sentences in both your validation and test datasets. I would prepare a mix (e.g. 50/50) of sentences and paragraphs because I would want the system to work well on both.

All the best,

1 Like