I tried training the default transformer model on FloydLab. When the train&eval loop gets to evaluation, the process stops at “INFO:tensorflow:Restoring parameters from /output/model.ckpt-…”. The GPU utilization goes to 0 and the memory (used during training at 4% of 60Gb) progressively goes to 100%, then the sys invariably crashes. If I use another model, let’s say the default seq2seq character one then no problem. Any experience with this issue in the past? Thanks a lot!
2018-03-28 03:54:22 PST2018-03-28 10:54:22.920203: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-28 03:54:22 PST2018-03-28 10:54:22.920582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
2018-03-28 03:54:22 PSTname: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8755
2018-03-28 03:54:22 PSTpciBusID: 0000:00:1e.0
2018-03-28 03:54:22 PSTtotalMemory: 11.17GiB freeMemory: 11.10GiB
2018-03-28 03:54:22 PST2018-03-28 10:54:22.920623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-03-28 03:54:23 PST2018-03-28 10:54:23.967381: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-03-28 03:54:23 PSTINFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_session_config': gpu_options {
2018-03-28 03:54:23 PST}
2018-03-28 03:54:23 PSTallow_soft_placement: true
2018-03-28 03:54:23 PST, '_keep_checkpoint_max': 8, '_task_type': 'worker', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f8ecaaba910>, '_save_checkpoints_steps': 1000, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 1000, '_model_dir': '/output', '_save_summary_steps': 1000}
2018-03-28 03:54:23 PSTINFO:tensorflow:Running training and evaluation locally (non-distributed).
2018-03-28 03:54:23 PSTINFO:tensorflow:Start train and evaluate loop. The evaluate will happen after 1200 secs (eval_spec.throttle_secs) or training is finished.
2018-03-28 03:54:33 PSTINFO:tensorflow:Create CheckpointSaverHook.
2018-03-28 03:54:35 PSTINFO:tensorflow:Number of trainable parameters: 28051736
2018-03-28 03:54:37 PST2018-03-28 10:54:37.493486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-03-28 03:54:53 PSTINFO:tensorflow:Saving checkpoints for 1 into /output/model.ckpt.
2018-03-28 03:54:56 PSTINFO:tensorflow:loss = 4.700724, step = 1
2018-03-28 03:55:44 PSTINFO:tensorflow:loss = 2.454182, step = 101 (48.396 sec)
2018-03-28 03:56:31 PSTINFO:tensorflow:loss = 1.9354556, step = 201 (46.845 sec)
2018-03-28 03:57:18 PSTINFO:tensorflow:loss = 2.7831678, step = 301 (47.024 sec)
2018-03-28 03:58:05 PSTINFO:tensorflow:loss = 1.803523, step = 401 (47.142 sec)
2018-03-28 03:58:52 PSTINFO:tensorflow:loss = 1.8611735, step = 501 (47.221 sec)
2018-03-28 03:59:40 PSTINFO:tensorflow:loss = 1.6110716, step = 601 (47.260 sec)
2018-03-28 04:00:27 PSTINFO:tensorflow:loss = 1.4854397, step = 701 (47.335 sec)
2018-03-28 04:01:14 PSTINFO:tensorflow:loss = 1.5282784, step = 801 (47.322 sec)
2018-03-28 04:02:02 PSTINFO:tensorflow:loss = 1.4539702, step = 901 (47.345 sec)
2018-03-28 04:02:49 PSTINFO:tensorflow:Saving checkpoints for 1001 into /output/model.ckpt.
2018-03-28 04:02:51 PSTINFO:tensorflow:global_step/sec: 2.10535
2018-03-28 04:02:51 PSTINFO:tensorflow:loss = 1.5363147, step = 1001 (49.093 sec)
2018-03-28 04:03:38 PSTINFO:tensorflow:loss = 1.6296451, step = 1101 (47.300 sec)
2018-03-28 04:04:25 PSTINFO:tensorflow:loss = 1.2548004, step = 1201 (47.336 sec)
2018-03-28 04:05:13 PSTINFO:tensorflow:loss = 1.4279612, step = 1301 (47.241 sec)
2018-03-28 04:06:00 PSTINFO:tensorflow:loss = 1.5853598, step = 1401 (47.203 sec)
2018-03-28 04:06:43 PSTINFO:tensorflow:loss = 2.689854, step = 1501 (42.929 sec)
2018-03-28 04:07:30 PSTINFO:tensorflow:loss = 2.118863, step = 1601 (47.162 sec)
2018-03-28 04:08:17 PSTINFO:tensorflow:loss = 2.3222811, step = 1701 (47.139 sec)
2018-03-28 04:09:04 PSTINFO:tensorflow:loss = 2.3114202, step = 1801 (47.109 sec)
2018-03-28 04:09:51 PSTINFO:tensorflow:loss = 2.4362457, step = 1901 (47.105 sec)
2018-03-28 04:10:38 PSTINFO:tensorflow:Saving checkpoints for 2001 into /output/model.ckpt.
2018-03-28 04:10:40 PSTINFO:tensorflow:global_step/sec: 2.12918
2018-03-28 04:10:40 PSTINFO:tensorflow:words_per_sec/features: 6133.62
2018-03-28 04:10:40 PSTINFO:tensorflow:words_per_sec/labels: 6131.24
2018-03-28 04:10:40 PSTINFO:tensorflow:loss = 2.2236152, step = 2001 (49.143 sec)
2018-03-28 04:11:28 PSTINFO:tensorflow:loss = 2.5957835, step = 2101 (47.052 sec)
2018-03-28 04:12:15 PSTINFO:tensorflow:loss = 2.2570279, step = 2201 (47.063 sec)
2018-03-28 04:13:02 PSTINFO:tensorflow:loss = 2.5881119, step = 2301 (46.970 sec)
2018-03-28 04:13:49 PSTINFO:tensorflow:loss = 2.204818, step = 2401 (47.005 sec)
2018-03-28 04:14:36 PSTINFO:tensorflow:loss = 2.4727502, step = 2501 (46.961 sec)
2018-03-28 04:14:36 PSTINFO:tensorflow:Saving checkpoints for 2501 into /output/model.ckpt.
2018-03-28 04:14:38 PSTINFO:tensorflow:Loss for final step: 2.4727502.
2018-03-28 04:14:43 PSTINFO:tensorflow:Starting evaluation at 2018-03-28-11:14:43
2018-03-28 04:14:43 PST2018-03-28 11:14:43.566779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-03-28 04:14:43 PSTINFO:tensorflow:Restoring parameters from /output/model.ckpt-2501
2018-03-28 04:23:28 PSTterminate called after throwing an instance of 'std::bad_alloc'
2018-03-28 04:23:28 PSTwhat(): std::bad_alloc
2018-03-28 04:23:31 PSTAborted (core dumped)