Does opennmt-py use loss masking?

jhandsel · November 15, 2020, 1:01am

Apologies if this is a silly question.

Does opennmt-py ignore the pad tokens when calculating loss? I couldn’t find anything on this in the documentation.

I’m working on a character-level seq2seq model, and the train set accuracy goes to ~90% after a few hundred steps. However, the translation is very poor. I tried to replicate this behavior with my own TensorFlow code, and the only way I could do so is if I turned off loss masking.

I’m tokenizing on-the-fly, if that makes any difference:

save_model: model/char
save_data: model/objects

# Specific arguments for pyonmttok
# Char level, space is separate token
src_onmttok_kwargs: "{'mode': 'char', 'spacer_annotate': True, 'spacer_new': True}"
tgt_onmttok_kwargs: "{'mode': 'char', 'spacer_annotate': True, 'spacer_new': True}"
src_vocab: vocab_char/src_vocab
tgt_vocab: vocab_char/tgt_vocab

# Corpus opts:
data:
    shard0:
        path_src: src_train
        path_tgt: tgt_train
        transforms: [onmt_tokenize,filtertoolong]
        weight: 1

    valid:
        path_src: src_val
        path_tgt: tgt_val
        transforms: [onmt_tokenize,filtertoolong]

#### Filter
src_seq_length: 200
tgt_seq_length: 128

francoishernandez · November 16, 2020, 9:42am

Yes,
The padding token index is passed to ignore_index in the criterion:

github.com

OpenNMT/OpenNMT-py/blob/7fd883baab1a5d99126a47c6a4b42e787921a1b3/onmt/utils/loss.py#L45


        len(tgt_field.vocab), opt.copy_attn_force,
        unk_index=unk_idx, ignore_index=padding_idx
    )
elif opt.label_smoothing > 0 and train:
    criterion = LabelSmoothingLoss(
        opt.label_smoothing, len(tgt_field.vocab), ignore_index=padding_idx
    )
elif isinstance(model.generator[-1], LogSparsemax):
    criterion = SparsemaxLoss(ignore_index=padding_idx, reduction='sum')
else:
    criterion = nn.NLLLoss(ignore_index=padding_idx, reduction='sum')

# if the loss function operates on vectors of raw logits instead of
# probabilities, only the first part of the generator needs to be
# passed to the NMTLossCompute. At the moment, the only supported
# loss function of this kind is the sparsemax loss.
use_raw_logits = isinstance(criterion, SparsemaxLoss)
loss_gen = model.generator[0] if use_raw_logits else model.generator
if opt.copy_attn:
    compute = onmt.modules.CopyGeneratorLossCompute(
        criterion, loss_gen, tgt_field.vocab, opt.copy_loss_by_seqlength,

Is everything else comparable between your tf run and OpenNMT-py one? Model size, batch size, batching method, …?

Also, to check your tokenization is properly applied, you can set the n_sample opt in your config to dump n_sample examples with the transforms applied.

jhandsel · November 19, 2020, 9:27pm

My bad, I didn’t realize that dropout isn’t applied when using a single layer LSTM. The strangely high accuracy was due to overfitting.