Settings for training TransformerBig with mixed precision on single GPU?

I’m trying to start a TransformerBig mixed precision training on a single RTX 2080Ti (11GB) but it fails to start with OOM errors. I have changed the batch size and type so my train section in the config is this:

train:
     keep_checkpoint_max: 8
     average_last_checkpoints: 5
     max_step: 200000
     batch_type: tokens
     batch_size: 2048

How should I tune the config? Thanks is advance.

Hi Panos, This isn’t an answer to your question :-). I have been training Transformer models on a single GTX 2080 Ti keeping my batch size down to 1024 tokens to fit everything in.
I am curious what you want to accomplish with TransformerBig. I am about to start a new training and would switch to that if there are clear advantages in terms of quality of output.
Cheers,
Terence

Hi Terence,

Most papers (including the Attention is All You Need) report a clear improvement on BLEU with deeper Transformers. I’d like to test this myself and compare the improvement with the training cost (which should be much higher).

Does the same training work without mixed precision?

I was using OpenNMT-tf v2.7. After updating to version 2.8, I can start a mixed precision training for a single step but then I get the OOM error. Without mixed precision it ran for a few steps without problems, but speed is very slow (0.26 steps/s, ~6500 words/sec).

How large is your vocabulary?

It contains 31991 units.

It’s possible there is higher than necessary memory overhead when training a mixed precision model. I will try to review that in the coming days.

I already pushed a small optimization to TensorFlow core:

1 Like

Thanks for your work @guillaumekln. I see this is already merged in master, so should we expect it to be included in a minor version update?

This will only land in TensorFlow 2.3 which is a few months away, but we could temporarily include it in OpenNMT-tf if needed. Tomorrow, I will run some tests on a compatible hardware to see how this change (and possibly others) could help.

1 Like

OK, great!

I just tested on a V100 and did not found major issues with the current OpenNMT-tf version. I could train with a batch size of 4096 with good performance.

How low should you configure the batch size for the training to run? Are you using auto_config?

You can also try training with shared embeddings to reduce the memory usage.

1 Like

Hello,

I’m a newcomer to NMT with a translation studies background, so please go easy on me with regards to IT knowledge expectations.

I’ve run into a similar issue as the thread starter, suggesting there might indeed be a bit of extra memory overhead for mixed precision training.

I recently completed training up to 70000 steps for my JP-EN patent data with full precision, but wanted to see if I could speed it up with mixed-precision on my RTX2060.

My vocabulary is 50000 tokens for both source and target. The training corpus is around 20+ million words on each.

When training in full precision at the following settings I have no OOM messages:

onmt-main --model_type Transformer --config ntc7o/datatransJPEN.yml config/transformer.yml --auto_config --gpu_allow_growth train --with_eval

config/transformer.yml
train:
batch_size: 1024
effective_batch_size: 4096
save_checkpoint_steps: 10000
keep_checkpoint_max: 5
seed: 3435
train_steps: 500000
valid_steps: 10000
warmup_steps: 8000
report_every: 100

eval:
batch_size: 32
steps: 10000
save_eval_predictions: true
external_evaluators: bleu
early_stopping:
metric: bleu
min_improvement: 0.01
steps: 4

However, training with the same settings with the only exception being vocabulary --size_multiple 8 and the --mixed_precision flag, the following is what I get:

INFO:tensorflow:Number of model parameters: 121002840
2020-03-23 11:01:30.668202: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1351] No whitelist ops found, nothing to do
2020-03-23 11:01:39.317652: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1857] Converted 2370/10505 nodes to float16 precision using 197 cast(s) to float16 (excluding Const and Variable casts)
2020-03-23 11:01:42.689462: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-03-23 11:01:45.873516: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-03-23 11:01:53.525157: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1351] No whitelist ops found, nothing to do
INFO:tensorflow:Saved checkpoint /home/chris/NMT/openNMT-tf/ntc7o/transmixedJPEN/ckpt-1
2020-03-23 11:01:58.390385: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 1.07G (1153105920 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-23 11:01:58.390899: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 989.72M (1037795328 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-03-23 11:01:58.391486: I tensorflow/stream_executor/cuda/cuda_driver.cc:801] failed to allocate 890.75M (934015744 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to (’/job:localhost/replica:0/task:0/device:CPU:0’,).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to (’/job:localhost/replica:0/task:0/device:CPU:0’,).
2020-03-23 11:02:44.138455: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1351] No whitelist ops found, nothing to do
INFO:tensorflow:Step = 100 ; steps/s = 0.85, source words/s = 3835, target words/s = 3572 ; Learning rate = 0.000012 ; Loss = 10.206561
INFO:tensorflow:Step = 200 ; steps/s = 2.10, source words/s = 9687, target words/s = 8974 ; Learning rate = 0.000025 ; Loss = 9.001592
INFO:tensorflow:Step = 300 ; steps/s = 2.11, source words/s = 9667, target words/s = 8942 ; Learning rate = 0.000037 ; Loss = 7.650735

Training still starts and is indeed a good bit faster than without mixed_precision (2.10 steps/s vs 1.70 step/s), but OOM errors are worrisome. I’ll report if results are as expected, but any ideas on how to avoid them?

P.S.: Notice the --gpu_allow_growth flag. This is a different issue altogether, but without this flag, training will not start at all. Appears to be an issue with RTX series?

Update 24_03_2020: Reached my early stopping metric early this evening. Everything seems to have worked regardless of the OOM messeges. Are they safe to ignore?

Looks like you can ignore them if the training is still running.


In 2.8.1, I backported the small optimization we talked about earlier. It’s not life changing but can possibly save a few hundreds of megabytes.
You could also try setting batch_size to 0. The training will first run a binary search to find a suitable batch size value.

I’ll try to run some more tests in a few days and report back.

Thank you! I will test and see if the optimization alone solves the OOM issues and also test the batch_size 0 setting.

I’ll keep you posted.

Edit:
While the optimization did not resolve the OOM error, training appears to run a bit faster at the same batch_size of 1024 (effective 4096) and jumped from 2.20 steps/s to 2.27 steps/s.

Using the autotuning batch_size 0 results in a batch_size of 819 and is thus quite a bit less efficient as it needs an extra pass to reach effective 4096. (1.77 steps/s)… Weirdly enough, I still get the OOM errors when training mixed-precission even at a batch_size of 819.

As a side note: not using --gpu_allow_growth still results in following error and aborted training:

2020-03-26 15:37:48.574708: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-03-26 15:37:48.574743: W ./tensorflow/stream_executor/stream.h:2039] attempting to perform DNN operation using StreamExecutor without DNN support
2020-03-26 15:37:48.574787: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Internal: cuDNN launch failure : input shape ([1,840,512,1])
[[{{node transformer_base/self_attention_encoder/self_attention_encoder_layer/transformer_layer_wrapper/layer_norm_1/FusedBatchNormV3}}]]

Likely this is a TF bug: https://github.com/tensorflow/tensorflow/issues/24496

I have started successfully a big transformer training with mixed precision and latest version of OpenNMT-tf (2.8.1). I had to drop the batch size to 1024 to do so. Speed is acceptable at 0.38 steps/s (~11k words/s).

I can confirm that the batch size is auto-tuned just fine when it’s set to 0. With this setting it has increased to 1586.