Complete Sentences / Sentences Order

SamuelLacombe · February 3, 2021, 5:33am

Hello there,

I’ve tried opennmt-py and so far, I’m delighted with the results, but I believe I can get some even better results if I can improve a little bit my input data.

I hope I’m posting this to the proper place… sorry if I’m not. It just seemed to me to be the best fitted place.

Here are the 2 questions that I’m not sure about:

Does it make a difference if my sentences are not always “complete”? They are relatively long, but my really long sentences are usually split in 2 or 3 if they were extremely long.
I have some really old data that is really good, but not as good as the new one… what would be the best approach to use it, yet give priority to the newest data? providing the sentences in a certain order would be sufficient?

Best regards,
Samuel

francoishernandez · February 4, 2021, 6:02pm

Depends on your task and your final goal. Also, you might want to be more specific as ‘extremely long’ can mean different orders of magnitude.
You can use weighting to upsample your good data, or you could also try to “tell” the model that it’s not the same kind of data (look for “tagged back-translation”, there is a paper). Also any notion of “order” might not be a good idea here, as it may break your model. You can look for “curriculum learning” if you really want to check this path.

SamuelLacombe · February 6, 2021, 5:24pm

Thank you for the quick reply.

Right now my sentences are maximum 256 chars. If i was to combine them, to make them real sentences, it could reach x10 sometimes. So i’m not sure if usually its better complete sentences or sentences that are cut at places where someone reading would usually do a “stop”.
will look into that!