I’m currently running some experiments with a TransformerBig model. Everything was going well until I started playing with the beam width (values from 5 to 10) and the length and coverage penalties (0.2 for both), as proposed by the literature (https://arxiv.org/abs/1609.08144).
The effect these parameters seem to have in inference time is that hypothesis tend to repeat tokens (usually bigrams or trigrams) at the end of the sequences.
The model I’m using was trained with a large dataset and should not be underfitted as it trained successfully until around 35k steps. Also, I’m tokenising off-line with SentencePiece (BPE) and using a batch size of 4096 tokens. I’m not adding BOS or EOS when tokenising.
My guess is that for such a batch size, and without BOS or EOS marks, the model is struggling when applying one or both penalties (length and/or coverage). Would that be possible? Should I add BOS and EOS marks for the beam and these penalties to work as expected? May the problem lie anywhere else?
Any hint on this matter would be greatly appreciated.