I’m running OpenNMT-tf V1 in a virtual environment with Python 2.7. Training proceeds OK but then after around 500 steps /opennmt/utils/decoding.py raises a Not ImplementedError (line 339) and tells me “Unified decoding does not support Tensorflow 1.4”
To get round this should I upgrade to the very latest (and last) TensorFlow 1.x which is TensorFlow 1.15, or which lower version of TensorFlow would support unified decoding?
TensorFlow 1.5 should be the oldest working version but I recommend installing a more recent one if possible.
Well, I created a new virtual environment for Python2.7 and installed TensorFlow 1.14 and OpenNMT-tf==1.2.52. Training started and before long I got an OOM with the messages below. The batch_size is 1024 and the batch_type tokens. Last week I trained a model with OpenNMT-tf V2 on the same machine with the same batch size and type without problems. Does anyone know whether these versions of ONMT-tf and TensorFlow actually require Python 3.5.
Extracts from messages:
2019-11-14 15:19:42.972168: I tensorflow/core/common_runtime/bfc_allocator.cc:816] Sum Total of in-use chunks: 9.78GiB
2019-11-14 15:19:42.972184: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 10978662656 memory_limit_: 10978662810 available bytes: 154 curr_region_allocation_bytes_: 21957325824
2019-11-14 15:19:42.972202: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats:
Limit: 10978662810
InUse: 10501994240
MaxInUse: 10501994240
NumAllocs: 13043832
MaxAllocSize: 3319452160
(1) Resource exhausted: OOM when allocating tensor with shape[30,8,1664,1664] and type float on
/job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node transformer/encoder_1/layer_0/multi_head/Softmax (defined at /local/lib/python2.7/site-packages/opennmt/layers/transformer.py:198) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
Errors may have originated from an input operation.
Input Source operations connected to node transformer/encoder_1/layer_0/multi_head/Softmax:
transformer/encoder_1/layer_0/multi_head/add (defined at /local/lib/python2.7/site-packages/opennmt/layers/transformer.py:195)
Input Source operations connected to node transformer/encoder_1/layer_0/multi_head/Softmax:
transformer/encoder_1/layer_0/multi_head/add (defined at /local/lib/python2.7/site-packages/opennmt/layers/transformer.py:195)
Original stack trace for u’transformer/encoder_1/layer_0/multi_head/Softmax’:
File “/bin/onmt-main”, line 8, in
sys.exit(main())
File “/local/lib/python2.7/site-packages/opennmt/bin/main.py”, line 172, in main
runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
File “/local/lib/python2.7/site-packages/opennmt/runner.py”, line 301, in train_and_evaluate
result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
File “/local/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/training.py”, line 473, in train_and_evaluate
return executor.run()
File “/local/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/training.py”, line 613, in run
return self.run_local()
File “/local/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/training.py”, line 714, in run_local
saving_listeners=saving_listeners)
File “/local/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py”, line 367, in train
Does it happen during evaluation?
indicates a sentence with 1664 tokens.
I am now getting tensor with shape [30,8,1119,1119] before training crashes. The evaluation batch size is 32 and I have even added batch_type:tokens to the config file, so I don’t know how it gets 1119 tokens.
Very many of the error messages preceding the crash refer to evaluation. I am puzzled here as I have trained a good many OpenNMT-tf models without these issues.
Maybe you should cleanup your evaluation file for too long sentences.
Yes, that may be the culprit particularly as I have used SentencePiece (with BPE). I’ll take things back a stage tomorrow.
The initial clean-up of evaluation files didn’t happen. Now training works great