Why NLL loss is not normalized by the number of tokens?

Guitaricet · September 22, 2019, 12:39am

I noticed that a lot of NMT implementations (including ONMT-Py) do not normalize loss by the number of tokens (nor batch size).
Is there some specific reason for this?

line of code from the repository:

github.com

OpenNMT/OpenNMT-py/blob/8cd68bcff849b2a4676bde720de6f80a239bfafe/onmt/utils/loss.py#L45


        len(tgt_field.vocab), opt.copy_attn_force,
        unk_index=unk_idx, ignore_index=padding_idx
    )
elif opt.label_smoothing > 0 and train:
    criterion = LabelSmoothingLoss(
        opt.label_smoothing, len(tgt_field.vocab), ignore_index=padding_idx
    )
elif isinstance(model.generator[-1], LogSparsemax):
    criterion = SparsemaxLoss(ignore_index=padding_idx, reduction='sum')
else:
    criterion = nn.NLLLoss(ignore_index=padding_idx, reduction='sum')


# if the loss function operates on vectors of raw logits instead of
# probabilities, only the first part of the generator needs to be
# passed to the NMTLossCompute. At the moment, the only supported
# loss function of this kind is the sparsemax loss.
use_raw_logits = isinstance(criterion, SparsemaxLoss)
loss_gen = model.generator[0] if use_raw_logits else model.generator
if opt.copy_attn:
    compute = onmt.modules.CopyGeneratorLossCompute(
        criterion, loss_gen, tgt_field.vocab, opt.copy_loss_by_seqlength,

vince62s · September 22, 2019, 8:20am

check here:

github.com

OpenNMT/OpenNMT-py/blob/9760ca1704a156d964003edbceae1b98b01b3c5d/onmt/utils/loss.py#L160-L168


if shard_size == 0:
    loss, stats = self._compute_loss(batch, **shard_state)
    return loss / float(normalization), stats
batch_stats = onmt.utils.Statistics()
for shard in shards(shard_state, shard_size):
    loss, stats = self._compute_loss(batch, **shard)
    loss.div(float(normalization)).backward()
    batch_stats.update(stats)
return None, batch_stats