Checkpoint averaging doesn't save the final model

fiqas · February 5, 2018, 6:47pm

I’m trying to average the last 3 checkpoints:

python -m bin.average_checkpoints --model_dir …/models/tensorflow/baseline --output_d
ir final_ensemble --max_count 3

(…)
INFO:tensorflow:Saving averaged checkpoint to final_ensemble/model.ckpt-80000

But the final model is not here:

-rw-r–r-- 1 user users 89 Feb 5 18:39 checkpoint
-rw-r–r-- 1 user users 321M Feb 5 18:39 model.ckpt-80000.data-00000-of-00001
-rw-r–r-- 1 user users 811 Feb 5 18:39 model.ckpt-80000.index
-rw-r–r-- 1 user users 67K Feb 5 18:39 model.ckpt-80000.meta

The highlighted file cannot be read for inference due to errors:

DataLossError (see above for traceback): Unable to open table file …/models/tensorflow/baseline/final_ensemble/model.ckpt-80000.data-00000-of-00001: Data los
s: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

EDIT: It turns out the model wasn’t saving ANY checkpoints during the whole training.
Each checkpoint has only these types of files:

-rw-r–r-- 1 user users 28 Feb 5 10:00 model.ckpt-43254.data-00000-of-00002
-rw-r–r-- 1 user users 321M Feb 5 10:00 model.ckpt-43254.data-00001-of-00002
-rw-r–r-- 1 user users 825 Feb 5 10:00 model.ckpt-43254.index
-rw-r–r-- 1 user users 1.2M Feb 5 10:00 model.ckpt-43254.meta

No model checkpoint whatsoever. This is how my config looks like for saving:

train:
batch_size: 64
bucket_width: 1
save_checkpoints_steps: 5000
save_summary_steps: 30
train_steps: 80000
maximum_features_length: 50
maximum_labels_length: 50
sample_buffer_size: 1000000 # Consider setting this to the training dataset size.
keep_checkpoint_max: 10
clip_gradients: 5.0

guillaumekln · February 5, 2018, 7:43pm

The checkpoint is not missing, it is formed by the 3 separate files ending with .data, .index, and .meta.

If you want to set the --checkpoint_path option of the main script, simply assign the common prefix, e.g.:

python -m bin.main infer [...] --checkpoint_path model.ckpt-43254

fiqas · February 5, 2018, 8:04pm

It still doesn’t work.

The model has finished training on its own.

I average the last 3 checkpoints:

python -m bin.average_checkpoints --model_dir …/models/tensorflow/baseline --output
_dir …/models/tensorflow/baseline/final_ensemble --max_count 3

INFO:tensorflow:Saving averaged checkpoint to …/models/tensorflow/baseline/final_ensemble/model.ckpt-80000

Now, I’m trying to translate with it:

python -m bin.main infer --config config/baseline.yml --features_file …/data/newstest
2016.bpe.tr --predictions_file …/models/tensorflow/baseline/newstest2016.bpe.tr.translated --checkpoint_path …/models/tensorflow/baseline/final_ensemble/mod
el.ckpt-80000

INFO:tensorflow:Restoring parameters from …/models/tensorflow/baseline/final_ensemble/model.ckpt-80000
2018-02-05 20:01:06.120897: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: End of sequence
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,?], [?]], output_types=[DT_INT64, DT_INT32], _device="/job:localhost/replica:0/task:0/dev
ice:CPU:0"]]

Could this be related to https://github.com/tensorflow/tensorflow/issues/12414 ?

guillaumekln · February 5, 2018, 8:08pm

It seems it worked as it should. Does the file …/models/tensorflow/baseline/newstest2016.bpe.tr.translated contain your translations?

This message:

2018-02-05 20:01:06.120897: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: End of sequence
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,?], [?]], output_types=[DT_INT64, DT_INT32], _device="/job:localhost/replica:0/task:0/dev
ice:CPU:0"]]

is actually not an error and means the iteration over the input file finished. I think it is now silenced in TensorFlow 1.5.0 (?).