Can't pass first evaluation - stuck at "restoring parameters"

I tried training the default transformer model on FloydLab. When the train&eval loop gets to evaluation, the process stops at “INFO:tensorflow:Restoring parameters from /output/model.ckpt-…”. The GPU utilization goes to 0 and the memory (used during training at 4% of 60Gb) progressively goes to 100%, then the sys invariably crashes. If I use another model, let’s say the default seq2seq character one then no problem. Any experience with this issue in the past? Thanks a lot!

2018-03-28 03:54:22 PST2018-03-28 10:54:22.920203: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-28 03:54:22 PST2018-03-28 10:54:22.920582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
2018-03-28 03:54:22 PSTname: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8755
2018-03-28 03:54:22 PSTpciBusID: 0000:00:1e.0
2018-03-28 03:54:22 PSTtotalMemory: 11.17GiB freeMemory: 11.10GiB
2018-03-28 03:54:22 PST2018-03-28 10:54:22.920623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-03-28 03:54:23 PST2018-03-28 10:54:23.967381: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-03-28 03:54:23 PSTINFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_session_config': gpu_options {
2018-03-28 03:54:23 PST}
2018-03-28 03:54:23 PSTallow_soft_placement: true
2018-03-28 03:54:23 PST, '_keep_checkpoint_max': 8, '_task_type': 'worker', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f8ecaaba910>, '_save_checkpoints_steps': 1000, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 1000, '_model_dir': '/output', '_save_summary_steps': 1000}
2018-03-28 03:54:23 PSTINFO:tensorflow:Running training and evaluation locally (non-distributed).
2018-03-28 03:54:23 PSTINFO:tensorflow:Start train and evaluate loop. The evaluate will happen after 1200 secs (eval_spec.throttle_secs) or training is finished.
2018-03-28 03:54:33 PSTINFO:tensorflow:Create CheckpointSaverHook.
2018-03-28 03:54:35 PSTINFO:tensorflow:Number of trainable parameters: 28051736
2018-03-28 03:54:37 PST2018-03-28 10:54:37.493486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-03-28 03:54:53 PSTINFO:tensorflow:Saving checkpoints for 1 into /output/model.ckpt.
2018-03-28 03:54:56 PSTINFO:tensorflow:loss = 4.700724, step = 1
2018-03-28 03:55:44 PSTINFO:tensorflow:loss = 2.454182, step = 101 (48.396 sec)
2018-03-28 03:56:31 PSTINFO:tensorflow:loss = 1.9354556, step = 201 (46.845 sec)
2018-03-28 03:57:18 PSTINFO:tensorflow:loss = 2.7831678, step = 301 (47.024 sec)
2018-03-28 03:58:05 PSTINFO:tensorflow:loss = 1.803523, step = 401 (47.142 sec)
2018-03-28 03:58:52 PSTINFO:tensorflow:loss = 1.8611735, step = 501 (47.221 sec)
2018-03-28 03:59:40 PSTINFO:tensorflow:loss = 1.6110716, step = 601 (47.260 sec)
2018-03-28 04:00:27 PSTINFO:tensorflow:loss = 1.4854397, step = 701 (47.335 sec)
2018-03-28 04:01:14 PSTINFO:tensorflow:loss = 1.5282784, step = 801 (47.322 sec)
2018-03-28 04:02:02 PSTINFO:tensorflow:loss = 1.4539702, step = 901 (47.345 sec)
2018-03-28 04:02:49 PSTINFO:tensorflow:Saving checkpoints for 1001 into /output/model.ckpt.
2018-03-28 04:02:51 PSTINFO:tensorflow:global_step/sec: 2.10535
2018-03-28 04:02:51 PSTINFO:tensorflow:loss = 1.5363147, step = 1001 (49.093 sec)
2018-03-28 04:03:38 PSTINFO:tensorflow:loss = 1.6296451, step = 1101 (47.300 sec)
2018-03-28 04:04:25 PSTINFO:tensorflow:loss = 1.2548004, step = 1201 (47.336 sec)
2018-03-28 04:05:13 PSTINFO:tensorflow:loss = 1.4279612, step = 1301 (47.241 sec)
2018-03-28 04:06:00 PSTINFO:tensorflow:loss = 1.5853598, step = 1401 (47.203 sec)
2018-03-28 04:06:43 PSTINFO:tensorflow:loss = 2.689854, step = 1501 (42.929 sec)
2018-03-28 04:07:30 PSTINFO:tensorflow:loss = 2.118863, step = 1601 (47.162 sec)
2018-03-28 04:08:17 PSTINFO:tensorflow:loss = 2.3222811, step = 1701 (47.139 sec)
2018-03-28 04:09:04 PSTINFO:tensorflow:loss = 2.3114202, step = 1801 (47.109 sec)
2018-03-28 04:09:51 PSTINFO:tensorflow:loss = 2.4362457, step = 1901 (47.105 sec)
2018-03-28 04:10:38 PSTINFO:tensorflow:Saving checkpoints for 2001 into /output/model.ckpt.
2018-03-28 04:10:40 PSTINFO:tensorflow:global_step/sec: 2.12918
2018-03-28 04:10:40 PSTINFO:tensorflow:words_per_sec/features: 6133.62
2018-03-28 04:10:40 PSTINFO:tensorflow:words_per_sec/labels: 6131.24
2018-03-28 04:10:40 PSTINFO:tensorflow:loss = 2.2236152, step = 2001 (49.143 sec)
2018-03-28 04:11:28 PSTINFO:tensorflow:loss = 2.5957835, step = 2101 (47.052 sec)
2018-03-28 04:12:15 PSTINFO:tensorflow:loss = 2.2570279, step = 2201 (47.063 sec)
2018-03-28 04:13:02 PSTINFO:tensorflow:loss = 2.5881119, step = 2301 (46.970 sec)
2018-03-28 04:13:49 PSTINFO:tensorflow:loss = 2.204818, step = 2401 (47.005 sec)
2018-03-28 04:14:36 PSTINFO:tensorflow:loss = 2.4727502, step = 2501 (46.961 sec)
2018-03-28 04:14:36 PSTINFO:tensorflow:Saving checkpoints for 2501 into /output/model.ckpt.
2018-03-28 04:14:38 PSTINFO:tensorflow:Loss for final step: 2.4727502.
2018-03-28 04:14:43 PSTINFO:tensorflow:Starting evaluation at 2018-03-28-11:14:43
2018-03-28 04:14:43 PST2018-03-28 11:14:43.566779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-03-28 04:14:43 PSTINFO:tensorflow:Restoring parameters from /output/model.ckpt-2501
2018-03-28 04:23:28 PSTterminate called after throwing an instance of 'std::bad_alloc'
2018-03-28 04:23:28 PSTwhat(): std::bad_alloc
2018-03-28 04:23:31 PSTAborted (core dumped)

We had similar issues when training and evaluating with TensorFlow 1.5 (which seems to be the latest version used on FloydLab). If that’s possible, you can still use the train run type instead of train_and_eval (assuming you are using a recent version, I suggest using OpenNMT-tf v1.0.1).

@vince62s Did you solve this issue at some point when using TensorFlow 1.5?

Thanks for the blazing fast answer Guillaume. I’m using the latest version, so I can try with train only.