Model diverged with loss = NaN with ParallelInputter

Ingvar-Y · July 11, 2022, 9:39pm

Hi, I want to use opennmt-tf to train SequenceClassifier on sequences of sequences of 120 characters where each sequence also has some additional numeric properties. Since these properties just provide additional numerical info about a sequence, I thought it was a good idea to concatenate them with sequence embedding which would then give numeric representation of a sequence. I defined embedder as follows:

embedder = inputters.ParallelInputter(
    [
        inputters.CharRNNEmbedder(embedding_size=3, num_units=24),
        inputters.SequenceRecordInputter(16),
    ],
    reducer=reducer.ConcatReducer(),
)

and then defined the whole model as follows:

model = models.SequenceClassifier(
    embedder,
    encoders.SelfAttentionEncoder(
        num_layers=2, num_units=16, num_heads=2, ffn_inner_dim=32
    ),
)

I used the following simple config to for this model:

model_dir: data/checkpoints/
data:
  train_features_file: 
    - 'data/train_features_data.txt'
    - 'data/train_features_records.records'
  train_labels_file: 'data/train_labels.txt'
  source_1_vocabulary: 'data/features_vocab.txt'
  target_vocabulary: 'data/labels_vocab.txt'
params:
  optimizer: Adam
  learning_rate: 0.001
train:
  batch_size: 16
  max_step: 25
  save_checkpoints_steps: 5
  keep_checkpoint_max: 25
  save_summary_steps: 1

However, when training this model with Runner I found out that after just 5 steps of training the model fails with RuntimeError: Model diverged with loss = NaN. No such problem happens if I remove ParallelInputter and SequenceRecordInputter altogether and if I increase the number of numeric properties to 18 the model fails on the first step. Am I missing something in the training setup or model structure that results in such problem? Model logs look pretty normal before the problem happens:

INFO:tensorflow:Training on 1965182 examples
INFO:tensorflow:Number of model parameters: 8432
INFO:tensorflow:Number of model weights: 40 (trainable = 40, non trainable = 0)
INFO:tensorflow:Step = 1 ; steps/s = 0.00 ; Learning rate = 0.001000 ; Loss = 0.267733
INFO:tensorflow:Saved checkpoint data/checkpoints/ckpt-1
INFO:tensorflow:Step = 2 ; steps/s = 0.00 ; Learning rate = 0.001000 ; Loss = 0.242620
INFO:tensorflow:Step = 3 ; steps/s = 9.71 ; Learning rate = 0.001000 ; Loss = 0.153637
INFO:tensorflow:Step = 4 ; steps/s = 10.30 ; Learning rate = 0.001000 ; Loss = 0.071665
INFO:tensorflow:Step = 5 ; steps/s = 14.10 ; Learning rate = 0.001000 ; Loss = 0.096159
INFO:tensorflow:Saved checkpoint data/checkpoints/ckpt-5

guillaumekln · July 15, 2022, 3:33pm

Hi,

What are these numerical properties? Can you give an example?

Ingvar-Y · July 17, 2022, 8:56am

Hi!

Numerical features are either binary or arctan(integer).

Ingvar-Y · July 17, 2022, 9:01am

Here is a small subsample:
https://drive.google.com/file/d/1i9XkDp6k8Z3kFLbOXOVw_xhRYICm8qhx/view?usp=sharing

guillaumekln · July 18, 2022, 9:35am

You can try applying a linear transformation to your numerical features so that the model can learn a representation that is consistent with the first feature.

You can subclass SequenceRecordInputter:

import opennmt
import tensorflow as tf

class MySequenceRecordInputter(opennmt.inputters.SequenceRecordInputter):
    def __init__(self, input_depth, num_units, **kwargs):
        super().__init__(input_depth, **kwargs)
        self.linear = tf.keras.layers.Dense(num_units)

    def call(self, features, training=None):
        inputs = super().call(features, training=training)
        return self.linear(inputs)

Ingvar-Y · July 18, 2022, 3:39pm

I tried that right now and it doesn’t seem to help.

guillaumekln · July 18, 2022, 3:59pm

It seems there are NaN values in your input records. Can you check that?

import tensorflow as tf

dataset = tf.data.TFRecordDataset("subsample/features_data_18.records")

for i, element in enumerate(dataset):
    _, feature_lists, lengths = tf.io.parse_sequence_example(
        element,
        sequence_features={
            "values": tf.io.FixedLenSequenceFeature(
                [18], dtype=tf.float32
            )
        }
    )

    tf.debugging.assert_all_finite(feature_lists["values"], "NaN or Inf in record %d" % i)

This code raises the error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: NaN or Inf in record 21 : Tensor had NaN values [Op:CheckNumerics]

Ingvar-Y · July 19, 2022, 3:06pm

It seems that was the problem. It was a rather stupid mistake from me, all things considered - I missed a step for only some of the features that guaranteed no NaN values and didn’t check for it afterwards. I still don’t understand why errors never showed up while using SequenceRecordInputter only but I guess I should have checked inputs earlier.