Hi, I want to use opennmt-tf to train SequenceClassifier on sequences of sequences of 120 characters where each sequence also has some additional numeric properties. Since these properties just provide additional numerical info about a sequence, I thought it was a good idea to concatenate them with sequence embedding which would then give numeric representation of a sequence. I defined embedder as follows:
embedder = inputters.ParallelInputter(
[
inputters.CharRNNEmbedder(embedding_size=3, num_units=24),
inputters.SequenceRecordInputter(16),
],
reducer=reducer.ConcatReducer(),
)
and then defined the whole model as follows:
model = models.SequenceClassifier(
embedder,
encoders.SelfAttentionEncoder(
num_layers=2, num_units=16, num_heads=2, ffn_inner_dim=32
),
)
I used the following simple config to for this model:
model_dir: data/checkpoints/
data:
train_features_file:
- 'data/train_features_data.txt'
- 'data/train_features_records.records'
train_labels_file: 'data/train_labels.txt'
source_1_vocabulary: 'data/features_vocab.txt'
target_vocabulary: 'data/labels_vocab.txt'
params:
optimizer: Adam
learning_rate: 0.001
train:
batch_size: 16
max_step: 25
save_checkpoints_steps: 5
keep_checkpoint_max: 25
save_summary_steps: 1
However, when training this model with Runner I found out that after just 5 steps of training the model fails with RuntimeError: Model diverged with loss = NaN
. No such problem happens if I remove ParallelInputter and SequenceRecordInputter altogether and if I increase the number of numeric properties to 18 the model fails on the first step. Am I missing something in the training setup or model structure that results in such problem? Model logs look pretty normal before the problem happens:
INFO:tensorflow:Training on 1965182 examples
INFO:tensorflow:Number of model parameters: 8432
INFO:tensorflow:Number of model weights: 40 (trainable = 40, non trainable = 0)
INFO:tensorflow:Step = 1 ; steps/s = 0.00 ; Learning rate = 0.001000 ; Loss = 0.267733
INFO:tensorflow:Saved checkpoint data/checkpoints/ckpt-1
INFO:tensorflow:Step = 2 ; steps/s = 0.00 ; Learning rate = 0.001000 ; Loss = 0.242620
INFO:tensorflow:Step = 3 ; steps/s = 9.71 ; Learning rate = 0.001000 ; Loss = 0.153637
INFO:tensorflow:Step = 4 ; steps/s = 10.30 ; Learning rate = 0.001000 ; Loss = 0.071665
INFO:tensorflow:Step = 5 ; steps/s = 14.10 ; Learning rate = 0.001000 ; Loss = 0.096159
INFO:tensorflow:Saved checkpoint data/checkpoints/ckpt-5