Does opennmt-py use loss masking?

Apologies if this is a silly question.

Does opennmt-py ignore the pad tokens when calculating loss? I couldn’t find anything on this in the documentation.

I’m working on a character-level seq2seq model, and the train set accuracy goes to ~90% after a few hundred steps. However, the translation is very poor. I tried to replicate this behavior with my own TensorFlow code, and the only way I could do so is if I turned off loss masking.

I’m tokenizing on-the-fly, if that makes any difference:

save_model: model/char
save_data: model/objects

# Specific arguments for pyonmttok
# Char level, space is separate token
src_onmttok_kwargs: "{'mode': 'char', 'spacer_annotate': True, 'spacer_new': True}"
tgt_onmttok_kwargs: "{'mode': 'char', 'spacer_annotate': True, 'spacer_new': True}"
src_vocab: vocab_char/src_vocab
tgt_vocab: vocab_char/tgt_vocab

# Corpus opts:
data:
    shard0:
        path_src: src_train
        path_tgt: tgt_train
        transforms: [onmt_tokenize,filtertoolong]
        weight: 1

    valid:
        path_src: src_val
        path_tgt: tgt_val
        transforms: [onmt_tokenize,filtertoolong]

#### Filter
src_seq_length: 200
tgt_seq_length: 128

Yes,
The padding token index is passed to ignore_index in the criterion:

Is everything else comparable between your tf run and OpenNMT-py one? Model size, batch size, batching method, …?

Also, to check your tokenization is properly applied, you can set the n_sample opt in your config to dump n_sample examples with the transforms applied.

My bad, I didn’t realize that dropout isn’t applied when using a single layer LSTM. The strangely high accuracy was due to overfitting.