python -m bin.average_checkpoints --model_dir …/models/tensorflow/baseline --output_d
ir final_ensemble --max_count 3
(…)
INFO:tensorflow:Saving averaged checkpoint to final_ensemble/model.ckpt-80000
But the final model is not here:
-rw-r–r-- 1 user users 89 Feb 5 18:39 checkpoint -rw-r–r-- 1 user users 321M Feb 5 18:39 model.ckpt-80000.data-00000-of-00001
-rw-r–r-- 1 user users 811 Feb 5 18:39 model.ckpt-80000.index
-rw-r–r-- 1 user users 67K Feb 5 18:39 model.ckpt-80000.meta
The highlighted file cannot be read for inference due to errors:
DataLossError (see above for traceback): Unable to open table file …/models/tensorflow/baseline/final_ensemble/model.ckpt-80000.data-00000-of-00001: Data los
s: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?
EDIT: It turns out the model wasn’t saving ANY checkpoints during the whole training.
Each checkpoint has only these types of files:
-rw-r–r-- 1 user users 28 Feb 5 10:00 model.ckpt-43254.data-00000-of-00002
-rw-r–r-- 1 user users 321M Feb 5 10:00 model.ckpt-43254.data-00001-of-00002
-rw-r–r-- 1 user users 825 Feb 5 10:00 model.ckpt-43254.index
-rw-r–r-- 1 user users 1.2M Feb 5 10:00 model.ckpt-43254.meta
No model checkpoint whatsoever. This is how my config looks like for saving:
train:
batch_size: 64
bucket_width: 1 save_checkpoints_steps: 5000
save_summary_steps: 30 train_steps: 80000
maximum_features_length: 50
maximum_labels_length: 50
sample_buffer_size: 1000000 # Consider setting this to the training dataset size. keep_checkpoint_max: 10
clip_gradients: 5.0
It seems it worked as it should. Does the file …/models/tensorflow/baseline/newstest2016.bpe.tr.translated contain your translations?
This message:
2018-02-05 20:01:06.120897: W tensorflow/core/framework/op_kernel.cc:1192] Out of range: End of sequence
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,?], [?]], output_types=[DT_INT64, DT_INT32], _device="/job:localhost/replica:0/task:0/dev
ice:CPU:0"]]
is actually not an error and means the iteration over the input file finished. I think it is now silenced in TensorFlow 1.5.0 (?).