Is it okay to translate a long paragraph with many sentences without doing anything else?

I used PyOpenNMT and when I tried to translate a longer paragraph, it just put out a limited long predictions with so much things lost, so I am not sure if is it okay to translate a long paragraph with many sentences without doing anything else ?

I noticed only a added in the end of training data.


If you trained your model on single sentences, you need to do that for translation also.

Oh, thank you for your timely response. This solved part of my doubt, but still like in we could input any length input, and get a well output, so my question is how to predict a long paragraph. I believe this is not just training problem, I am not sure if it is possible to just add some BOS and EOS in some part of paragraph with context vectors to make a long translation.

Just split it into sentences.

Splitting is not easy like, should I split by a comma or a period, and I think this could lose many information.

There are several sentence splitters out there. For example NLTK has a PunktSentenceTokenizer.

The general rule is that you should process your inference data the same way as the training data.


Nice answers. So do you think the reason my models had a limited output because of the limited length of my model input ? Because I always had a limited output like about 20 Chinese characters with whatever the single input sentence length is.

Hard to tell. It depends on your training data, how well your model is trained, etc.

The main point of this thread is that the model will behave incorrectly when feeding a paragraph instead of a single sentence (assuming it was trained on single sentences).

I agree. Thanks.