Hello,
I’m a newcomer to NMT with a translation studies background, so please go easy on me with regards to IT knowledge expectations.
I’ve run into a similar issue as the thread starter, suggesting there might indeed be a bit of extra memory overhead for mixed precision training.
I recently completed training up to 70000 steps for my JP-EN patent data with full precision, but wanted to see if I could speed it up with mixed-precision on my RTX2060.
My vocabulary is 50000 tokens for both source and target. The training corpus is around 20+ million words on each.
When training in full precision at the following settings I have no OOM messages:
onmt-main --model_type Transformer --config ntc7o/datatransJPEN.yml config/transformer.yml --auto_config --gpu_allow_growth train --with_eval
config/transformer.yml
train:
batch_size: 1024
effective_batch_size: 4096
save_checkpoint_steps: 10000
keep_checkpoint_max: 5
seed: 3435
train_steps: 500000
valid_steps: 10000
warmup_steps: 8000
report_every: 100
eval:
batch_size: 32
steps: 10000
save_eval_predictions: true
external_evaluators: bleu
early_stopping:
metric: bleu
min_improvement: 0.01
steps: 4
However, training with the same settings with the only exception being vocabulary --size_multiple 8 and the --mixed_precision flag, the following is what I get:
INFO:tensorflow:Number of model parameters: 121002840
2020-03-23 11:01:30.668202: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1351] No whitelist ops found, nothing to do
2020-03-23 11:01:39.317652: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1857] Converted 2370/10505 nodes to float16 precision using 197 cast(s) to float16 (excluding Const and Variable casts)
2020-03-23 11:01:42.689462: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-23 11:01:45.873516: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-23 11:01:53.525157: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1351] No whitelist ops found, nothing to do
INFO:tensorflow:Saved checkpoint /home/chris/NMT/openNMT-tf/ntc7o/transmixedJPEN/ckpt-1
2020-03-23 11:01:58.390385: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 1.07G (1153105920 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-23 11:01:58.390899: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 989.72M (1037795328 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-23 11:01:58.391486: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 890.75M (934015744 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to (’/job:localhost/replica:0/task:0/device:CPU:0’,).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to (’/job:localhost/replica:0/task:0/device:CPU:0’,).
2020-03-23 11:02:44.138455: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1351] No whitelist ops found, nothing to do
INFO:tensorflow:Step = 100 ; steps/s = 0.85, source words/s = 3835, target words/s = 3572 ; Learning rate = 0.000012 ; Loss = 10.206561
INFO:tensorflow:Step = 200 ; steps/s = 2.10, source words/s = 9687, target words/s = 8974 ; Learning rate = 0.000025 ; Loss = 9.001592
INFO:tensorflow:Step = 300 ; steps/s = 2.11, source words/s = 9667, target words/s = 8942 ; Learning rate = 0.000037 ; Loss = 7.650735
Training still starts and is indeed a good bit faster than without mixed_precision (2.10 steps/s vs 1.70 step/s), but OOM errors are worrisome. I’ll report if results are as expected, but any ideas on how to avoid them?
P.S.: Notice the --gpu_allow_growth flag. This is a different issue altogether, but without this flag, training will not start at all. Appears to be an issue with RTX series?
Update 24_03_2020: Reached my early stopping metric early this evening. Everything seems to have worked regardless of the OOM messeges. Are they safe to ignore?