Optimal Hyperparameter setting

I was carrying out a couple of experiments and my GPU was running out of memory. As a result, I had to lower my batch size and it worked fine. I observed that changing batch size varies the BLEU score. what are the hyperparameters I need to vary when I change the batch size or any other parameter so as to stick on to the best BLEU score possible?

You might want to use gradient accumulation (accum_count // accum_steps) to simulate bigger batches.

1 Like

Many thanks for the response. Let me trying varying the accum_count//acum_steps.

Suppose I want the below setting:

Bacth_size * world_size * accum_count i.e 4096 * 1* 3 (original setting) but since it was running out of memory for me.so
64 * 6 * 192 should match up the original setting?

Do you use batch_type tokens or sents?
Because 4096 is supposed to be a token batch size, but your proposal of 64 implies an example batch size ?

Maybe you just did not set the batch type properly and that’s why you’re getting some OOM in the first place?

Hi @francoishernandez,
I have set my batch type to tokens, and my training data has 1.3 million sentences. Yes, 64 implies the current batch I’m using because when I use other higher batch sizes I get a Cuda memory error.


Please post your full config / command line. Batch size 64 in tokens does not make much sense.

# General opts

save_model: data/Geneu

keep_checkpoint: 50

save_checkpoint_steps: 5000

average_decay: 0.0005

seed: 2053

report_every: 500

train_steps: 100000

valid_steps: 5000

# Batching

queue_size: 10000

bucket_size: 32768

world_size: 1

gpu_ranks: [0]

batch_type: "tokens"

batch_size: 4096

valid_batch_size: 16

max_generator_batches: 0

accum_count: [3]

accum_steps: [0]

# Optimization

model_dtype: "fp32"

optim: "adam"

learning_rate: 2

warmup_steps: 16000

decay_method: "noam"

adam_beta2: 0.998

max_grad_norm: 0

label_smoothing: 0.1

param_init: 0

param_init_glorot: true

normalization: "tokens"

early_stopping: 4

# Model

encoder_type: transformer

decoder_type: transformer

enc_layers: 4

dec_layers: 6

heads: 8

rnn_size: 512

word_vec_size: 512

transformer_ff: 2048

dropout_steps: [0]

dropout: [0.1]

attention_dropout: [0.1]

#share_decoder_embeddings: true

This was my initial setup which was throwing me exception “out of memory”. Hence I had to modify my batch size and accum_count accordingly.

What GPU are you using?
When does the OOM happen?
You may need to add the filtertoolong transform to your transforms to ignore examples that may be too big to fit in memory.

Hi ,
Im using NVIDIA rtx2080ti GPU, I see that the OOM is happening after 4200 steps.As you mentioned I will try using filtertoolong option.Let me know if there is any major mistake in the set up i have been using.

Just one note @prashanth - I believe world_size is irrelevant here. You do not change it unless you have this number of GPUs which should reflect on gpu_ranks as well. I assume it might be a typo, but just in case.

Hi Yasmin,
Based on this thread world_size is relevant True_batch_size.Taking this into consideration I had done those modifications.


Yasmin is right. world_size is the number of GPUs that will be used. In your case, since you seem to be using only one GPU, then it must remain 1.

Also, since you’re using an RTX 2080Ti, you might want to set model_dtype to “fp16” to take advantage of mixed precision training (less VRAM usage and a bit faster).

The above setting was the initial setting which was throwing OOM exception. So I was trying other combinations based on True Batch. 64*6*192 was run using 6 GPU’s.