Improvement of performance by data normalization

Thanks for your answer @jean.senellart, however, I think I don’t really understand what you mean exactly. What do you mean by balanced corpus after annotation? Moreover, is the number I get correct, should I expect this behavior?

Hello, with placeholders - what the training is seeing are sentences like:

source blabla ⦅URL⦆source blabla => target blabla
  • note that the actual source sequence is hidden (the actual urls)
  • now if you have unbalanced sentences - where there are ⦅URL⦆in the source and not in the target (or not the same count), the NMT model will have to learn to translate a placeholder to something. for your training it happened to be a number coming from such unbalanced sentences in your training

So in other words:

  • preprocess your corpus to annotate with placeholders
  • filter-out all sentence pairs where you don’t have source/target mapping of placeholders
  • train!
1 Like

Thanks! That’s awesome! That’s definitely the reason for my problem

Hi jean @jean.senellart I am unable to use this hook aproach. Could you please elaborate it ?
My problem is with sentences like below:
SRC : This Section was not in the Act as originally enacted, but came into force by virtue of an Amendment Act of 2009 with effect from 27.10.2009.

But the translation of above sentence contains date(27.10.2009) first then the number 2009.
If I have multiple dates for instance, how will my placeholder know which date is to be replaced?
Kindly correct me wherever I am wrong