Translation removes the new line characters

ab585 · September 28, 2021, 11:29pm

I have two sentences which are being input to my translation interface. The two sentences are separated by one or more than one new line characters. Example -

Sentence 1.
Sentence 2.

The translated sentence gets converted to Translation of Sentence 1. Translation of Sentence 2.

What would be the best way to preserve the information about the new line characters so that the translated output instead is -

Translation of Sentence 1.
Translation of Sentence 2.

dmarin · September 29, 2021, 7:05am

Hi Anurag,

This issue recalls me this other one:

It is quite likely that you lose those new line characters during normalisation or tokenisation. One way to preserve that information would be to protect the new lines with a special token in the target segments when preprocessing data for training and inference, then restoring the new lines when denormalising/detokenasing. This is what I did for non-breaking spaces, and it worked well.

Hope this helps.

Daniel

ymoslem · September 29, 2021, 4:59pm

Dear Anurag,

Daniel’s solution is perfect, and I am happy that he shared it.

The way I am achieving this with CTranslate2 is that I split the input text on new lines, split sentences in each line, and then pass one list of sentences to CTranslate2 at a time. Maybe this is not the best method, but you now got the idea.

You can find the code here, under the “advanced” directory:

Kind regards,
Yasmin

tel34 · September 30, 2021, 11:19am

Hi Anurag, I can confirm Yasmin’s solution. You can, of course, do the splitting of paragraphs into one line per sentence in a pre-processing stage. The NLTK toolkit has scripts for this. Terence

ab585 · October 11, 2021, 7:19pm

Thanks for the reply everyone. The solution you mentioned works like a charm.