Training loss spikes often, is this normal?

xun468 · July 17, 2018, 3:30pm

Hey, this is my first time using Tensorflow directly so apologies if it is very obvious. I have defined a model that is similar to one I have made in Keras. The hyperparameters are the same and I am using the Adam optimizer for both. I have noticed that when training with OpenNMT there are very large spikes, with one of the biggest I’ve noticed being from 4 to 15 in 50 steps. While there were spikes when training with Keras, it has never as large as the ones I’ve seen in OpenNMT. Is this behavior normal?

guillaumekln · July 17, 2018, 4:39pm

Hello,

Can you give details about your model and training configuration?

xun468 · July 17, 2018, 5:12pm

Yes, my model is

def model():

  return onmt.models.SequenceToSequence(
      source_inputter=onmt.inputters.WordEmbedder(
              vocabulary_file_key="source_words_vocabulary",
              embedding_size=256,
              dropout=0.5,
              embedding_file_with_header=False),

      target_inputter=onmt.inputters.WordEmbedder(
          vocabulary_file_key="target_words_vocabulary",
          embedding_size=256),

      encoder=onmt.encoders.rnn_encoder.RNMTPlusEncoder(
          num_layers=2,
          num_units=150,
          cell_class=tf.contrib.rnn.LSTMCell,
          dropout=0.5),

      decoder=onmt.decoders.AttentionalRNNDecoder(
          num_layers=2,
          num_units=300,
          bridge=onmt.layers.bridge.CopyBridge(),
          attention_mechanism_class=tf.contrib.seq2seq.LuongAttention,
          cell_class=tf.contrib.rnn.LSTMCell,
          dropout=0.5,
          residual_connections=False))

And my yml is

model_dir: modelstuff

data:
train_features_file: data/train_in.txt
train_labels_file: data/train_tgt.txt

eval_features_file: data/test_in.txt
eval_labels_file: data/test_tgt.txt

source_words_vocabulary: data/inputvocab.txt
target_words_vocabulary: data/outputvocab.txt

params:
optimizer: AdamOptimizer
optimizer_params:
beta1: 0.9
beta2: 0.999
learning_rate: 0.001

train:
batch_size: 64
bucket_width: 1
save_checkpoints_steps: 5000
save_summary_steps: 100
train_steps: 15000
maximum_features_length: 50
maximum_labels_length: 50

sample_buffer_size: -1

eval:
eval_delay: 18000 # Every 5 hours.

infer:
batch_size: 30

Sorry for the lack of formatting, I am not sure how to add them here

guillaumekln · July 17, 2018, 7:02pm

Are you monitoring the loss in the training logs or with TensorBoard?

xun468 · July 17, 2018, 7:11pm

I am monitoring it through the training logs since I am running the model on a remote server.

guillaumekln · July 17, 2018, 7:29pm

So the observation is expected but indeed surprising. By default, the loss is only normalized across the batch dimension so its value depends on the sequence length.

There are several ways to work around that:

monitor the training loss with TensorBoard which only reports token-level loss
or monitor only the evaluation loss
Train with the token-level loss by adding the following in the training configuration:

params:
  average_loss_in_time: true

xun468 · July 17, 2018, 7:36pm

Thank you! I am curious then, what loss is OpenNMT training on (and displaying I assume)?

guillaumekln · July 17, 2018, 7:48pm

By default, it’s the standard softmax cross entropy divided by the batch size.