I have what I fear is a real NOOB question but I’ve been struggling with this for a while so figured I’d try asking here. If there’s an answer to this elsewhere and I’ve missed it, please just redirect me.

How is the loss function computed in a translation or summarization example that SGD or whatever optimization you’ve configured can be applied? That is, if you have an input, it will generate an output based on the weights in the model. That output is then compared to the target as the loss function. But what is that actual function for that comparison? Is it looking for matching exact vocab elements? Or something more holistic?

I do see that you can set up scoring every checkpoint or such with BLEU, but that seems to be a very different granularity than on a per-batch basis.

The actual implementation itself is in /onmt/utils/loss.py. You have the option to use NLLLoss, LabelSmoothingLoss, or something else. NLLLoss computes standard categorical cross-entropy loss and LabelSmoothingLoss (as discussed in https://arxiv.org/abs/1512.00567) computes loss by comparing how different the prediction is from the ground truth using Kullback-Leibler divergence.

One important thing to note is that OpenNMT’s decoder is trained by teacher forcing. During training, the correct target word (along with the state vector) is fed as the next input to the decoder. Hence, the output of the decoder and ground truth will have the same/similar size- this makes it easy to compare the output with ground truth and to calculate the loss (as well as accuracy).