Worse training result than OpenNMT-py

HaoTruong · June 9, 2019, 11:48am

Hi,
I’m very interested in some OpenNMT features that tensorflow version has and I try to experiment with this version. I’m previously using OpenNMT-py and I tried to setup model in OpenNMT-tf as close to python version as possible.

I train my Chinese - Vietnamese model with 32k sentence pairs on train set, about 2k sentence pairs on dev and test set.

Sadly, when I’m using the default Rnnsmall autoconfig setting, the model get overfitted really quickly. I tried changing the params and use custom model with no success.

Whatever I tried, the model seems slowly increasing evaluation loss, sometimes immediately on the next evaluation after the first evaluation.

This is the best run I can get so far. I use BiRNN model with BLEU score of 30.87 on step 65k
loss
After step 15k, eval loss slowly increase so I stopped training eventually.

This is the config I use:

model_dir: model

data:
  eval_features_file: cv.dev.cn
  eval_labels_file: cv.dev.vn
  source_words_vocabulary: src-vocab.txt
  target_words_vocabulary: tgt-vocab.txt
  train_features_file: cv.train.cn
  train_labels_file: cv.train.vn
eval:
  batch_size: 32
  eval_delay: 0
  exporters: last
infer:
  batch_size: 32
  bucket_width: 0
params:
  average_loss_in_time: true
  beam_width: 5
  learning_rate: 2.0 # The scale constant.
  clip_gradients: null
  decay_step_duration: 8 # 1 decay step is 8 training steps.
  decay_type: noam_decay_v2
  decay_params:
    model_dim: 512
    warmup_steps: 2000 # (= 16000 training steps).
  start_decay_steps: 0
  label_smoothing: 0.1
  length_penalty: 0
  gradients_accum: 1
  optimizer: AdamOptimizer
  optimizer_params:
    beta1: 0.9
    beta2: 0.998
score:
  batch_size: 64
train:
  average_last_checkpoints: 8
  batch_size: 4096
  batch_type: tokens
  keep_checkpoint_max: 8
  maximum_features_length: 50
  maximum_labels_length: 50
  sample_buffer_size: -1
  save_checkpoints_steps: 5000
  save_summary_steps: 100
  train_steps: 200000

This is my custom model

def model():
  return onmt.models.SequenceToSequence(
      source_inputter=onmt.inputters.WordEmbedder(
          vocabulary_file_key="source_words_vocabulary",
          embedding_size=512,
          dtype=tf.float16
          ),
      target_inputter=onmt.inputters.WordEmbedder(
          vocabulary_file_key="target_words_vocabulary",
          embedding_size=512,
          dtype=tf.float16
          ),
      encoder=onmt.encoders.BidirectionalRNNEncoder(
          num_layers=2,
          num_units=500,
          reducer=onmt.layers.ConcatReducer(),
          cell_class=tf.nn.rnn_cell.LSTMCell,
          dropout=0.2,
          residual_connections=False
          ),
      decoder=onmt.decoders.AttentionalRNNDecoder(
          num_layers=2,
          num_units=500,
          bridge=onmt.layers.CopyBridge(),
          attention_mechanism_class=tf.contrib.seq2seq.LuongAttention,
          cell_class=tf.contrib.rnn.LSTMCell,
          dropout=0.2,
          residual_connections=False))

Compared to OpenNMT-py, I got BLEU score of 37.83 with the following config

!CUDA_VISIBLE_DEVICES=0 python train.py -data data/test.atok.low -save_model demo_model -gpu_ranks 0 -optim adam -learning_rate 0.001 -encoder_type brnn \
  -dropout 0.2 -word_vec_size 512 -train_steps 200000\
  -batch_type tokens -normalization tokens \
  -label_smoothing 0.1

I hope I can get some help, why the model on tensorflow get overfitted so easily and the translation quality is worse than the py version despite I tried setting them up as close to each other as possible?

guillaumekln · June 9, 2019, 12:22pm

Hi,

Here are some differences:

In -py you used a fixed learning rate of 0,001 while you use the Noam schedule in -tf
In -py you set -batch_type tokens with the default batch size of 64, while in -tf you set this value to 4096
You are using FP16 in the -tf model.

So maybe try these parameters:

params:
  average_loss_in_time: true
  beam_width: 5
  label_smoothing: 0.1
  optimizer: AdamOptimizer
  learning_rate: 0.001
train:
  batch_size: 64
  batch_type: tokens
  maximum_features_length: 50
  maximum_labels_length: 50
  sample_buffer_size: -1
  save_checkpoints_steps: 5000
  train_steps: 200000

HaoTruong · June 9, 2019, 3:14pm

Hi, thanks for your quick reply.

I will retry with those new parameters and see if it has any differences.

If I know correctly, using fp16 will improve training performance but not really has much impact on training quality, right?

guillaumekln · June 9, 2019, 5:06pm

It still changes the way the model is trained (dynamic loss scaling). Also, I’m not sure it is currently optimized for RNN models.