Improvement of performance by data normalization


(Anderleich) #22

Thanks for your answer @jean.senellart, however, I think I don’t really understand what you mean exactly. What do you mean by balanced corpus after annotation? Moreover, is the number I get correct, should I expect this behavior?


(jean.senellart) #23

Hello, with placeholders - what the training is seeing are sentences like:

source blabla ⦅URL⦆source blabla => target blabla
  • note that the actual source sequence is hidden (the actual urls)
  • now if you have unbalanced sentences - where there are ⦅URL⦆in the source and not in the target (or not the same count), the NMT model will have to learn to translate a placeholder to something. for your training it happened to be a number coming from such unbalanced sentences in your training

So in other words:

  • preprocess your corpus to annotate with placeholders
  • filter-out all sentence pairs where you don’t have source/target mapping of placeholders
  • train!

(Anderleich) #24

Thanks! That’s awesome! That’s definitely the reason for my problem