Does opennmt-py ignore the pad tokens when calculating loss? I couldn’t find anything on this in the documentation.
I’m working on a character-level seq2seq model, and the train set accuracy goes to ~90% after a few hundred steps. However, the translation is very poor. I tried to replicate this behavior with my own TensorFlow code, and the only way I could do so is if I turned off loss masking.
I’m tokenizing on-the-fly, if that makes any difference:
save_model: model/char
save_data: model/objects
# Specific arguments for pyonmttok
# Char level, space is separate token
src_onmttok_kwargs: "{'mode': 'char', 'spacer_annotate': True, 'spacer_new': True}"
tgt_onmttok_kwargs: "{'mode': 'char', 'spacer_annotate': True, 'spacer_new': True}"
src_vocab: vocab_char/src_vocab
tgt_vocab: vocab_char/tgt_vocab
# Corpus opts:
data:
shard0:
path_src: src_train
path_tgt: tgt_train
transforms: [onmt_tokenize,filtertoolong]
weight: 1
valid:
path_src: src_val
path_tgt: tgt_val
transforms: [onmt_tokenize,filtertoolong]
#### Filter
src_seq_length: 200
tgt_seq_length: 128
Yes,
The padding token index is passed to ignore_index in the criterion:
Is everything else comparable between your tf run and OpenNMT-py one? Model size, batch size, batching method, …?
Also, to check your tokenization is properly applied, you can set the n_sample opt in your config to dump n_sample examples with the transforms applied.