Thanks a lot for the hints, Guillaume.
Regarding the batch size, I usually use 4096 tokens (as with my older machines). If I increase it, I get OOM errors. If I decrease it, performance is impacted accordingly, so nothing out of the ordinary here, I think.
Interestingly, I get an error if I try to autotune the batch size. I thought this was related to my machine/configuration, so I didn’t create a GitHub issue, but let me know if it would be useful. The error is as follows:
INFO:tensorflow:... failed.
INFO:tensorflow:Trying training with batch size 4287...
INFO:tensorflow:... failed.
[...]
ERROR:tensorflow:Last training attempt exited with an error:
Traceback (most recent call last):
load_model_module
[...]
raise ValueError("Model configuration not found in %s" % path)
ValueError: Model configuration not found in [local_path]/run/model_description.py
Regarding mixed precision, I have it always enabled. When you shared your last benchmarks, I was able reproduce the figures with and without mixed precision in my older machines (2 GPUs). In my new machines, mixed precision seems to work fine too. If I disable it, performance is reduced signficantly (40% more or less). The related check seems OK:
INFO:tensorflow:Mixed precision compatibility check (mixed_float16): OK
Your GPUs will likely run quickly with dtype policy mixed_float16 as they all have compute capability of at least 7.0
Regarding horovod, my implementation/environment makes a bit difficult to use it, so I would prefer not to. Actually, I tried it out when this post was originally published, but I eventually discarded it, as performance was fine for me without using it…
Regarding the training configuration, the trainer is run with these values:
trainer:
architecture: TransformerBigSharedEmbeddings
mixed_precision: true
num_gpus: 4
And the config (which works fine with 2 GPUs):
train:
average_last_checkpoints: 8
batch_size: 4096
batch_type: tokens
effective_batch_size: 25000
keep_checkpoint_max: 8
length_bucket_width: 1
max_step: 200000
maximum_features_length: 256
maximum_labels_length: 256
mixed_precision: true
moving_average_decay: 0.9999
replace_unknown_target: true
sample_buffer_size: 500000
save_checkpoints_steps: 1000
save_summary_steps: 200
single_pass: false
Double-checking this config, I just noticed that “mixed_precision” is also passed here, but I don’t know why. Anyway, it is passed to the trainer, so it should work nevertheless…
Note: I have also tried the last TensorFlow patch (2.4.4.), but no changes.
Update: I had originally written that mixed precision was not being applied, but it was. I corrected this above too.