Settings for training TransformerBig with mixed precision on single GPU?

panosk · March 9, 2020, 10:44am

I’m trying to start a TransformerBig mixed precision training on a single RTX 2080Ti (11GB) but it fails to start with OOM errors. I have changed the batch size and type so my train section in the config is this:

train:
     keep_checkpoint_max: 8
     average_last_checkpoints: 5
     max_step: 200000
     batch_type: tokens
     batch_size: 2048

How should I tune the config? Thanks is advance.

tel34 · March 9, 2020, 11:37am

Hi Panos, This isn’t an answer to your question :-). I have been training Transformer models on a single GTX 2080 Ti keeping my batch size down to 1024 tokens to fit everything in.
I am curious what you want to accomplish with TransformerBig. I am about to start a new training and would switch to that if there are clear advantages in terms of quality of output.
Cheers,
Terence

panosk · March 9, 2020, 12:04pm

Hi Terence,

Most papers (including the Attention is All You Need) report a clear improvement on BLEU with deeper Transformers. I’d like to test this myself and compare the improvement with the training cost (which should be much higher).

guillaumekln · March 9, 2020, 12:43pm

Does the same training work without mixed precision?

panosk · March 9, 2020, 2:03pm

I was using OpenNMT-tf v2.7. After updating to version 2.8, I can start a mixed precision training for a single step but then I get the OOM error. Without mixed precision it ran for a few steps without problems, but speed is very slow (0.26 steps/s, ~6500 words/sec).

guillaumekln · March 12, 2020, 2:56pm

How large is your vocabulary?

panosk · March 12, 2020, 4:54pm

It contains 31991 units.

guillaumekln · March 16, 2020, 4:41pm

It’s possible there is higher than necessary memory overhead when training a mixed precision model. I will try to review that in the coming days.

I already pushed a small optimization to TensorFlow core:

panosk · March 16, 2020, 5:06pm

Thanks for your work @guillaumekln. I see this is already merged in master, so should we expect it to be included in a minor version update?

guillaumekln · March 16, 2020, 5:26pm

This will only land in TensorFlow 2.3 which is a few months away, but we could temporarily include it in OpenNMT-tf if needed. Tomorrow, I will run some tests on a compatible hardware to see how this change (and possibly others) could help.

panosk · March 16, 2020, 5:44pm

OK, great!

guillaumekln · March 17, 2020, 9:57am

I just tested on a V100 and did not found major issues with the current OpenNMT-tf version. I could train with a batch size of 4096 with good performance.

How low should you configure the batch size for the training to run? Are you using auto_config?

You can also try training with shared embeddings to reduce the memory usage.

Dixxy · March 23, 2020, 10:02am

Hello,

I’m a newcomer to NMT with a translation studies background, so please go easy on me with regards to IT knowledge expectations.

I’ve run into a similar issue as the thread starter, suggesting there might indeed be a bit of extra memory overhead for mixed precision training.

I recently completed training up to 70000 steps for my JP-EN patent data with full precision, but wanted to see if I could speed it up with mixed-precision on my RTX2060.

My vocabulary is 50000 tokens for both source and target. The training corpus is around 20+ million words on each.

When training in full precision at the following settings I have no OOM messages:

onmt-main --model_type Transformer --config ntc7o/datatransJPEN.yml config/transformer.yml --auto_config --gpu_allow_growth train --with_eval

config/transformer.yml
train:
batch_size: 1024
effective_batch_size: 4096
save_checkpoint_steps: 10000
keep_checkpoint_max: 5
seed: 3435
train_steps: 500000
valid_steps: 10000
warmup_steps: 8000
report_every: 100

eval:
batch_size: 32
steps: 10000
save_eval_predictions: true
external_evaluators: bleu
early_stopping:
metric: bleu
min_improvement: 0.01
steps: 4

However, training with the same settings with the only exception being vocabulary --size_multiple 8 and the --mixed_precision flag, the following is what I get:

INFO:tensorflow:Number of model parameters: 121002840
2020-03-23 11:01:30.668202: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1351] No whitelist ops found, nothing to do
2020-03-23 11:01:39.317652: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1857] Converted 2370/10505 nodes to float16 precision using 197 cast(s) to float16 (excluding Const and Variable casts)
2020-03-23 11:01:42.689462: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-23 11:01:45.873516: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-23 11:01:53.525157: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1351] No whitelist ops found, nothing to do
INFO:tensorflow:Saved checkpoint /home/chris/NMT/openNMT-tf/ntc7o/transmixedJPEN/ckpt-1
2020-03-23 11:01:58.390385: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 1.07G (1153105920 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-23 11:01:58.390899: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 989.72M (1037795328 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-23 11:01:58.391486: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 890.75M (934015744 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to (’/job:localhost/replica:0/task:0/device:CPU:0’,).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to (’/job:localhost/replica:0/task:0/device:CPU:0’,).
2020-03-23 11:02:44.138455: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1351] No whitelist ops found, nothing to do
INFO:tensorflow:Step = 100 ; steps/s = 0.85, source words/s = 3835, target words/s = 3572 ; Learning rate = 0.000012 ; Loss = 10.206561
INFO:tensorflow:Step = 200 ; steps/s = 2.10, source words/s = 9687, target words/s = 8974 ; Learning rate = 0.000025 ; Loss = 9.001592
INFO:tensorflow:Step = 300 ; steps/s = 2.11, source words/s = 9667, target words/s = 8942 ; Learning rate = 0.000037 ; Loss = 7.650735

Training still starts and is indeed a good bit faster than without mixed_precision (2.10 steps/s vs 1.70 step/s), but OOM errors are worrisome. I’ll report if results are as expected, but any ideas on how to avoid them?

P.S.: Notice the --gpu_allow_growth flag. This is a different issue altogether, but without this flag, training will not start at all. Appears to be an issue with RTX series?

Update 24_03_2020: Reached my early stopping metric early this evening. Everything seems to have worked regardless of the OOM messeges. Are they safe to ignore?

guillaumekln · March 25, 2020, 10:16am

Looks like you can ignore them if the training is still running.

In 2.8.1, I backported the small optimization we talked about earlier. It’s not life changing but can possibly save a few hundreds of megabytes.
You could also try setting batch_size to 0. The training will first run a binary search to find a suitable batch size value.

panosk · March 25, 2020, 1:33pm

I’ll try to run some more tests in a few days and report back.

Dixxy · March 26, 2020, 1:14pm

Thank you! I will test and see if the optimization alone solves the OOM issues and also test the batch_size 0 setting.

I’ll keep you posted.

Edit:
While the optimization did not resolve the OOM error, training appears to run a bit faster at the same batch_size of 1024 (effective 4096) and jumped from 2.20 steps/s to 2.27 steps/s.

Using the autotuning batch_size 0 results in a batch_size of 819 and is thus quite a bit less efficient as it needs an extra pass to reach effective 4096. (1.77 steps/s)… Weirdly enough, I still get the OOM errors when training mixed-precission even at a batch_size of 819.

As a side note: not using --gpu_allow_growth still results in following error and aborted training:

2020-03-26 15:37:48.574708: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-03-26 15:37:48.574743: W ./tensorflow/stream_executor/stream.h:2039] attempting to perform DNN operation using StreamExecutor without DNN support
2020-03-26 15:37:48.574787: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: cuDNN launch failure : input shape ([1,840,512,1])
[[{{node transformer_base/self_attention_encoder/self_attention_encoder_layer/transformer_layer_wrapper/layer_norm_1/FusedBatchNormV3}}]]

Likely this is a TF bug: https://github.com/tensorflow/tensorflow/issues/24496

panosk · March 30, 2020, 6:35am

I have started successfully a big transformer training with mixed precision and latest version of OpenNMT-tf (2.8.1). I had to drop the batch size to 1024 to do so. Speed is acceptable at 0.38 steps/s (~11k words/s).

panosk · April 2, 2020, 1:07pm

I can confirm that the batch size is auto-tuned just fine when it’s set to 0. With this setting it has increased to 1586.