Hi,
I was carrying out a couple of experiments and my GPU was running out of memory. As a result, I had to lower my batch size and it worked fine. I observed that changing batch size varies the BLEU score. what are the hyperparameters I need to vary when I change the batch size or any other parameter so as to stick on to the best BLEU score possible?
You might want to use gradient accumulation (accum_count
// accum_steps
) to simulate bigger batches.
Many thanks for the response. Let me trying varying the accum_count//acum_steps.
Hi,
Suppose I want the below setting:
Bacth_size * world_size * accum_count i.e 4096 * 1* 3 (original setting) but since it was running out of memory for me.so
64 * 6 * 192 should match up the original setting?
Do you use batch_type
tokens
or sents
?
Because 4096 is supposed to be a token batch size, but your proposal of 64 implies an example batch size ?
Maybe you just did not set the batch type properly and that’s why you’re getting some OOM in the first place?
Hi @francoishernandez,
I have set my batch type to tokens, and my training data has 1.3 million sentences. Yes, 64 implies the current batch I’m using because when I use other higher batch sizes I get a Cuda memory error.
Prashanth
Please post your full config / command line. Batch size 64 in tokens does not make much sense.
# General opts
save_model: data/Geneu
keep_checkpoint: 50
save_checkpoint_steps: 5000
average_decay: 0.0005
seed: 2053
report_every: 500
train_steps: 100000
valid_steps: 5000
# Batching
queue_size: 10000
bucket_size: 32768
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 4096
valid_batch_size: 16
max_generator_batches: 0
accum_count: [3]
accum_steps: [0]
# Optimization
model_dtype: "fp32"
optim: "adam"
learning_rate: 2
warmup_steps: 16000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"
early_stopping: 4
# Model
encoder_type: transformer
decoder_type: transformer
enc_layers: 4
dec_layers: 6
heads: 8
rnn_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
#share_decoder_embeddings: true
This was my initial setup which was throwing me exception “out of memory”. Hence I had to modify my batch size and accum_count accordingly.
What GPU are you using?
When does the OOM happen?
You may need to add the filtertoolong
transform to your transforms
to ignore examples that may be too big to fit in memory.
Hi ,
Im using NVIDIA rtx2080ti GPU, I see that the OOM is happening after 4200 steps.As you mentioned I will try using filtertoolong option.Let me know if there is any major mistake in the set up i have been using.
Just one note @prashanth - I believe world_size
is irrelevant here. You do not change it unless you have this number of GPUs which should reflect on gpu_ranks
as well. I assume it might be a typo, but just in case.
Hi Yasmin,
Based on this thread world_size is relevant True_batch_size.Taking this into consideration I had done those modifications.
Prashanth
Yasmin is right. world_size
is the number of GPUs that will be used. In your case, since you seem to be using only one GPU, then it must remain 1
.
Also, since you’re using an RTX 2080Ti, you might want to set model_dtype
to “fp16” to take advantage of mixed precision training (less VRAM usage and a bit faster).
The above setting was the initial setting which was throwing OOM exception. So I was trying other combinations based on True Batch. 64*6*192
was run using 6 GPU’s.
Prashanth