Unified decoding does not support TensorFlow 1.4

tel34 · November 13, 2019, 8:59pm

I’m running OpenNMT-tf V1 in a virtual environment with Python 2.7. Training proceeds OK but then after around 500 steps /opennmt/utils/decoding.py raises a Not ImplementedError (line 339) and tells me “Unified decoding does not support Tensorflow 1.4”
To get round this should I upgrade to the very latest (and last) TensorFlow 1.x which is TensorFlow 1.15, or which lower version of TensorFlow would support unified decoding?

guillaumekln · November 13, 2019, 9:31pm

TensorFlow 1.5 should be the oldest working version but I recommend installing a more recent one if possible.

tel34 · November 14, 2019, 5:09pm

Well, I created a new virtual environment for Python2.7 and installed TensorFlow 1.14 and OpenNMT-tf==1.2.52. Training started and before long I got an OOM with the messages below. The batch_size is 1024 and the batch_type tokens. Last week I trained a model with OpenNMT-tf V2 on the same machine with the same batch size and type without problems. Does anyone know whether these versions of ONMT-tf and TensorFlow actually require Python 3.5.
Extracts from messages:
2019-11-14 15:19:42.972168: I tensorflow/core/common_runtime/bfc_allocator.cc:816] Sum Total of in-use chunks: 9.78GiB
2019-11-14 15:19:42.972184: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 10978662656 memory_limit_: 10978662810 available bytes: 154 curr_region_allocation_bytes_: 21957325824
2019-11-14 15:19:42.972202: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats:
Limit: 10978662810
InUse: 10501994240
MaxInUse: 10501994240
NumAllocs: 13043832
MaxAllocSize: 3319452160

(1) Resource exhausted: OOM when allocating tensor with shape[30,8,1664,1664] and type float on
/job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node transformer/encoder_1/layer_0/multi_head/Softmax (defined at /local/lib/python2.7/site-packages/opennmt/layers/transformer.py:198) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node transformer/encoder_1/layer_0/multi_head/Softmax:
transformer/encoder_1/layer_0/multi_head/add (defined at /local/lib/python2.7/site-packages/opennmt/layers/transformer.py:195)

Input Source operations connected to node transformer/encoder_1/layer_0/multi_head/Softmax:
transformer/encoder_1/layer_0/multi_head/add (defined at /local/lib/python2.7/site-packages/opennmt/layers/transformer.py:195)

Original stack trace for u’transformer/encoder_1/layer_0/multi_head/Softmax’:
File “/bin/onmt-main”, line 8, in
sys.exit(main())
File “/local/lib/python2.7/site-packages/opennmt/bin/main.py”, line 172, in main
runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
File “/local/lib/python2.7/site-packages/opennmt/runner.py”, line 301, in train_and_evaluate
result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
File “/local/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/training.py”, line 473, in train_and_evaluate
return executor.run()
File “/local/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/training.py”, line 613, in run
return self.run_local()
File “/local/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/training.py”, line 714, in run_local
saving_listeners=saving_listeners)
File “/local/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py”, line 367, in train

guillaumekln · November 14, 2019, 5:12pm

Does it happen during evaluation?

indicates a sentence with 1664 tokens.

tel34 · November 14, 2019, 7:38pm

I am now getting tensor with shape [30,8,1119,1119] before training crashes. The evaluation batch size is 32 and I have even added batch_type:tokens to the config file, so I don’t know how it gets 1119 tokens.
Very many of the error messages preceding the crash refer to evaluation. I am puzzled here as I have trained a good many OpenNMT-tf models without these issues.

guillaumekln · November 14, 2019, 7:59pm

Maybe you should cleanup your evaluation file for too long sentences.

tel34 · November 14, 2019, 8:15pm

Yes, that may be the culprit particularly as I have used SentencePiece (with BPE). I’ll take things back a stage tomorrow.

tel34 · November 17, 2019, 12:33pm

The initial clean-up of evaluation files didn’t happen. Now training works great