Applying length and coverage penalties


I’m currently running some experiments with a TransformerBig model. Everything was going well until I started playing with the beam width (values from 5 to 10) and the length and coverage penalties (0.2 for both), as proposed by the literature (

The effect these parameters seem to have in inference time is that hypothesis tend to repeat tokens (usually bigrams or trigrams) at the end of the sequences.
The model I’m using was trained with a large dataset and should not be underfitted as it trained successfully until around 35k steps. Also, I’m tokenising off-line with SentencePiece (BPE) and using a batch size of 4096 tokens. I’m not adding BOS or EOS when tokenising.

My guess is that for such a batch size, and without BOS or EOS marks, the model is struggling when applying one or both penalties (length and/or coverage). Would that be possible? Should I add BOS and EOS marks for the beam and these penalties to work as expected? May the problem lie anywhere else?

Any hint on this matter would be greatly appreciated.

1 Like


Coverage penalty usually does not work with Transformer models as they have multiple attention heads, while LSTM models (as used in the paper you linked) have a single attention head that can be interpreted as target-source alignment. So I suggest disabling coverage penalty.

BOS and EOS tokens are automatically added on the target side by the system.

35k steps is not a lot, especially for a TransformerBig model. You should probably continue the training to at least 100k steps. Very good models are trained for 500k steps or more.

Thanks very much, Guillaume. This is very helpful. I confirm that it was the coverage penalty the reason of the erratic predictions.

Regarding the steps, the training stopped at 35k steps because it met early-stopping conditions (in my case, not improving more than 1.0 BLEU in 3 * 2000 steps); test scores looked reasonably good, though. But now I wonder, given that I’m using “only” 10M pairs, and that I cannot afford too long training runs (100k steps would take me 3 days), maybe I should give preference to Transformer base models instead? I will test further this case, but any insight for taking the right direction will also be very helpful.

It all depends on the translation quality you expect out of the system and the training time you can afford.

But with 10M sentences and a TransformerBig model, just note that the model will keep improving for many many steps.

If you can afford it, you can try hardening the early stopping condition (1.0 BLEU is a large delta, try 0.1 for example).

I see. Many thanks for the hints. I will try your suggestion and see how long it takes.