Just out of curiosity:
Is there a specific reason why LabelSmoothingLoss is applied only during training and not during validation? Wouldn’t the comparison between training loss and validation loss be logical if both steps used the same loss function?
Label smoothing is a training trick. It makes little sense to apply it for validation where you want the log likelihood of the true target.
To improve loss comparison, it’s possible to compute both the standard loss and the smoothed loss and use the first for reporting and the second for computing the gradients.