Hey, this is my first time using Tensorflow directly so apologies if it is very obvious. I have defined a model that is similar to one I have made in Keras. The hyperparameters are the same and I am using the Adam optimizer for both. I have noticed that when training with OpenNMT there are very large spikes, with one of the biggest I’ve noticed being from 4 to 15 in 50 steps. While there were spikes when training with Keras, it has never as large as the ones I’ve seen in OpenNMT. Is this behavior normal?
Can you give details about your model and training configuration?
Yes, my model is
return onmt.models.SequenceToSequence( source_inputter=onmt.inputters.WordEmbedder( vocabulary_file_key="source_words_vocabulary", embedding_size=256, dropout=0.5, embedding_file_with_header=False), target_inputter=onmt.inputters.WordEmbedder( vocabulary_file_key="target_words_vocabulary", embedding_size=256), encoder=onmt.encoders.rnn_encoder.RNMTPlusEncoder( num_layers=2, num_units=150, cell_class=tf.contrib.rnn.LSTMCell, dropout=0.5), decoder=onmt.decoders.AttentionalRNNDecoder( num_layers=2, num_units=300, bridge=onmt.layers.bridge.CopyBridge(), attention_mechanism_class=tf.contrib.seq2seq.LuongAttention, cell_class=tf.contrib.rnn.LSTMCell, dropout=0.5, residual_connections=False))
And my yml is
eval_delay: 18000 # Every 5 hours.
Sorry for the lack of formatting, I am not sure how to add them here
Are you monitoring the loss in the training logs or with TensorBoard?
I am monitoring it through the training logs since I am running the model on a remote server.
So the observation is expected but indeed surprising. By default, the loss is only normalized across the batch dimension so its value depends on the sequence length.
There are several ways to work around that:
- monitor the training loss with TensorBoard which only reports token-level loss
- or monitor only the evaluation loss
- Train with the token-level loss by adding the following in the training configuration:
params: average_loss_in_time: true
Thank you! I am curious then, what loss is OpenNMT training on (and displaying I assume)?
By default, it’s the standard softmax cross entropy divided by the batch size.