Tensorflow. When is training complete?

opennmt-tf

(Peter Molevith) #1

Hi there
Sorry for this beginner question, but I’m not sure how to determine the progress of the training.

I’m training a corpus of about 400.000 sentences using Tensorflow OpenNMT
I’m on Windows on non-gpu hardware.

My questions are:

  1. When will training complete?
  2. Is it ok to test during training?
  3. Why are tests almost only showing useless output?

Training parameters:
python -m bin.main train_and_eval --model config/models/nmt_small.py --config config/opennmt-defaults.yml config/data/xxx.yml

I have created the vocabs for src and tgt.
I have not modified the default config files, except point to other files.

When I test, I use:
python -m bin.main infer --config config/opennmt-defaults.yml config/data/xxx.yml --features_file data/xxx/src-test.txt
The output is 99% <unk> and </s> , so completely useless.
But it’s while it’s still training.
Training has been for 4 days.

Log tail from the current training:
INFO:tensorflow:Loss for final step: 60.82655.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-04-02-04:44:31
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from xxx\model.ckpt-39393
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-04-02-04:47:42
INFO:tensorflow:Saving dict for global step 39393: global_step = 39393, loss = 4
.2886634
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: [‘serving_default’]
INFO:tensorflow:Restoring parameters from xxx\model.ckpt-39393
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:Assets written to: b"xxx\export\latest\temp-b’1522644463’
\assets"
INFO:tensorflow:SavedModel written to: b"xxx\export\latest\temp-b’1522644
463’\saved_model.pb"
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Number of trainable parameters: 87083345
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from xxx\model.ckpt-39393
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 39394 into xxx\model.ckpt.
INFO:tensorflow:loss = 49.204033, step = 39394
INFO:tensorflow:global_step/sec: 0.119733
INFO:tensorflow:words_per_sec/features: 122.171
INFO:tensorflow:words_per_sec/labels: 120.3
INFO:tensorflow:global_step/sec: 0.134033
INFO:tensorflow:loss = 48.57039, step = 39494 (790.641 sec)
INFO:tensorflow:words_per_sec/features: 134.854
INFO:tensorflow:words_per_sec/labels: 132.298
INFO:tensorflow:global_step/sec: 0.11271
INFO:tensorflow:words_per_sec/features: 111.115
INFO:tensorflow:words_per_sec/labels: 108.798
INFO:tensorflow:global_step/sec: 0.112425
INFO:tensorflow:loss = 63.985336, step = 39594 (888.359 sec)
INFO:tensorflow:words_per_sec/features: 120.994
INFO:tensorflow:words_per_sec/labels: 116.269
INFO:tensorflow:global_step/sec: 0.133264
INFO:tensorflow:words_per_sec/features: 134.877
INFO:tensorflow:words_per_sec/labels: 134.162
INFO:tensorflow:global_step/sec: 0.12646
INFO:tensorflow:loss = 51.120525, step = 39694 (770.595 sec)
INFO:tensorflow:words_per_sec/features: 134.063
INFO:tensorflow:words_per_sec/labels: 130.707
INFO:tensorflow:global_step/sec: 0.131255
INFO:tensorflow:words_per_sec/features: 136.23
INFO:tensorflow:words_per_sec/labels: 132.752
INFO:tensorflow:global_step/sec: 0.131476
INFO:tensorflow:loss = 77.925156, step = 39794 (761.219 sec)
INFO:tensorflow:words_per_sec/features: 132.579
INFO:tensorflow:words_per_sec/labels: 131.904
INFO:tensorflow:global_step/sec: 0.1223
INFO:tensorflow:words_per_sec/features: 126.435
INFO:tensorflow:words_per_sec/labels: 123.761
INFO:tensorflow:global_step/sec: 0.128446
INFO:tensorflow:loss = 59.011623, step = 39894 (798.097 sec)
INFO:tensorflow:words_per_sec/features: 136.087
INFO:tensorflow:words_per_sec/labels: 132.545
INFO:tensorflow:global_step/sec: 0.101857
INFO:tensorflow:words_per_sec/features: 108.341
INFO:tensorflow:words_per_sec/labels: 107.155
INFO:tensorflow:global_step/sec: 0.112868
INFO:tensorflow:loss = 66.56572, step = 39994 (933.880 sec)
INFO:tensorflow:words_per_sec/features: 116.497
INFO:tensorflow:words_per_sec/labels: 114.39
INFO:tensorflow:global_step/sec: 0.113701
INFO:tensorflow:words_per_sec/features: 123.973
INFO:tensorflow:words_per_sec/labels: 122.798
INFO:tensorflow:global_step/sec: 0.124485
INFO:tensorflow:loss = 97.666565, step = 40094 (841.419 sec)
INFO:tensorflow:words_per_sec/features: 127.033
INFO:tensorflow:words_per_sec/labels: 125.911
INFO:tensorflow:global_step/sec: 0.125362
INFO:tensorflow:words_per_sec/features: 127.508
INFO:tensorflow:words_per_sec/labels: 126.811
INFO:tensorflow:global_step/sec: 0.117176
INFO:tensorflow:loss = 40.40155, step = 40194 (825.538 sec)
INFO:tensorflow:words_per_sec/features: 122.729
INFO:tensorflow:words_per_sec/labels: 118.677
INFO:tensorflow:global_step/sec: 0.128379
INFO:tensorflow:words_per_sec/features: 134.496
INFO:tensorflow:words_per_sec/labels: 130.503
INFO:tensorflow:global_step/sec: 0.106624
INFO:tensorflow:loss = 101.765274, step = 40294 (858.407 sec)
INFO:tensorflow:words_per_sec/features: 113.769
INFO:tensorflow:words_per_sec/labels: 109.035
INFO:tensorflow:global_step/sec: 0.125137
INFO:tensorflow:words_per_sec/features: 132.143
INFO:tensorflow:words_per_sec/labels: 130.938
INFO:tensorflow:Saving checkpoints for 40394 into xxx\model.ckpt.
INFO:tensorflow:global_step/sec: 0.113762
INFO:tensorflow:loss = 86.29198, step = 40394 (839.079 sec)
INFO:tensorflow:words_per_sec/features: 113.935
INFO:tensorflow:words_per_sec/labels: 113.077
INFO:tensorflow:global_step/sec: 0.128195
INFO:tensorflow:words_per_sec/features: 132.538
INFO:tensorflow:words_per_sec/labels: 131.333
INFO:tensorflow:global_step/sec: 0.108179
INFO:tensorflow:loss = 47.689987, step = 40494 (852.229 sec)
INFO:tensorflow:words_per_sec/features: 119.852
INFO:tensorflow:words_per_sec/labels: 117.083
INFO:tensorflow:global_step/sec: 0.125795
INFO:tensorflow:words_per_sec/features: 131.947
INFO:tensorflow:words_per_sec/labels: 131.202
INFO:tensorflow:global_step/sec: 0.114199
INFO:tensorflow:loss = 48.773567, step = 40594 (835.303 sec)
INFO:tensorflow:words_per_sec/features: 115.668
INFO:tensorflow:words_per_sec/labels: 114.811

Thank you in advance.

Best
Peter


(Guillaume Klein) #2

When will training complete?

With some experience, you know how many training steps is required to train your model (configured with the train_steps options). Otherwise, the train_and_eval run type reports the loss on the validation dataset frequently. You should monitor this value and consider the training as complete when it does not go down anymore.

Is it ok to test during training?

If you have enough resources, of course you can start another instance with the infer run type.

Why are tests almost only showing useless output?

Could be that more training steps are required or the test file you used is unrelated to your training data (i.e. another language, domain, or tokenization resulting in lots of out of vocabulary words).

Overall, working on this type of task without GPU will be a long and painful process.