I’ve been training a punctuation restoration model using tranformers. Data has the following format:
src: this is a test
tgt: NONE NONE NONE PERIOD
While evaluating different checkpoints I found that BLEU score oscillates between 35 and 0 for different checkpoints. Ex: checkpoint_10000 (35BLEU), checkpoint_20000 (0.23 BLEU), checkpoint_30000 (33.4 BLEU) and so on. When looking at the output I found most of the lines are empty. It seems a strange behaviour? Any guess from where this behaviour might come?