Training loss spikes often, is this normal?

tensorflow

(Xun468) #1

Hey, this is my first time using Tensorflow directly so apologies if it is very obvious. I have defined a model that is similar to one I have made in Keras. The hyperparameters are the same and I am using the Adam optimizer for both. I have noticed that when training with OpenNMT there are very large spikes, with one of the biggest I’ve noticed being from 4 to 15 in 50 steps. While there were spikes when training with Keras, it has never as large as the ones I’ve seen in OpenNMT. Is this behavior normal?


(Guillaume Klein) #2

Hello,

Can you give details about your model and training configuration?


(Xun468) #3

Yes, my model is

def model():

  return onmt.models.SequenceToSequence(
      source_inputter=onmt.inputters.WordEmbedder(
              vocabulary_file_key="source_words_vocabulary",
              embedding_size=256,
              dropout=0.5,
              embedding_file_with_header=False),

      target_inputter=onmt.inputters.WordEmbedder(
          vocabulary_file_key="target_words_vocabulary",
          embedding_size=256),

      encoder=onmt.encoders.rnn_encoder.RNMTPlusEncoder(
          num_layers=2,
          num_units=150,
          cell_class=tf.contrib.rnn.LSTMCell,
          dropout=0.5),

      decoder=onmt.decoders.AttentionalRNNDecoder(
          num_layers=2,
          num_units=300,
          bridge=onmt.layers.bridge.CopyBridge(),
          attention_mechanism_class=tf.contrib.seq2seq.LuongAttention,
          cell_class=tf.contrib.rnn.LSTMCell,
          dropout=0.5,
          residual_connections=False))

And my yml is

model_dir: modelstuff

data:
train_features_file: data/train_in.txt
train_labels_file: data/train_tgt.txt

eval_features_file: data/test_in.txt
eval_labels_file: data/test_tgt.txt

source_words_vocabulary: data/inputvocab.txt
target_words_vocabulary: data/outputvocab.txt

params:
optimizer: AdamOptimizer
optimizer_params:
beta1: 0.9
beta2: 0.999
learning_rate: 0.001

train:
batch_size: 64
bucket_width: 1
save_checkpoints_steps: 5000
save_summary_steps: 100
train_steps: 15000
maximum_features_length: 50
maximum_labels_length: 50

sample_buffer_size: -1

eval:
eval_delay: 18000 # Every 5 hours.

infer:
batch_size: 30

Sorry for the lack of formatting, I am not sure how to add them here


(Guillaume Klein) #4

Are you monitoring the loss in the training logs or with TensorBoard?


(Xun468) #5

I am monitoring it through the training logs since I am running the model on a remote server.


(Guillaume Klein) #6

So the observation is expected but indeed surprising. By default, the loss is only normalized across the batch dimension so its value depends on the sequence length.

There are several ways to work around that:

  • monitor the training loss with TensorBoard which only reports token-level loss
  • or monitor only the evaluation loss
  • Train with the token-level loss by adding the following in the training configuration:
params:
  average_loss_in_time: true

(Xun468) #7

Thank you! I am curious then, what loss is OpenNMT training on (and displaying I assume)?


(Guillaume Klein) #8

By default, it’s the standard softmax cross entropy divided by the batch size.