I have what I fear is a real NOOB question but I’ve been struggling with this for a while so figured I’d try asking here. If there’s an answer to this elsewhere and I’ve missed it, please just redirect me.
How is the loss function computed in a translation or summarization example that SGD or whatever optimization you’ve configured can be applied? That is, if you have an input, it will generate an output based on the weights in the model. That output is then compared to the target as the loss function. But what is that actual function for that comparison? Is it looking for matching exact vocab elements? Or something more holistic?
I do see that you can set up scoring every checkpoint or such with BLEU, but that seems to be a very different granularity than on a per-batch basis.