I was carrying out a couple of experiments and my GPU was running out of memory. As a result, I had to lower my batch size and it worked fine. I observed that changing batch size varies the BLEU score. what are the hyperparameters I need to vary when I change the batch size or any other parameter so as to stick on to the best BLEU score possible?
You might want to use gradient accumulation (
accum_steps) to simulate bigger batches.
Many thanks for the response. Let me trying varying the accum_count//acum_steps.
Suppose I want the below setting:
Bacth_size * world_size * accum_count i.e 4096 * 1* 3 (original setting) but since it was running out of memory for me.so
64 * 6 * 192 should match up the original setting?
Do you use
Because 4096 is supposed to be a token batch size, but your proposal of 64 implies an example batch size ?
Maybe you just did not set the batch type properly and that’s why you’re getting some OOM in the first place?
I have set my batch type to tokens, and my training data has 1.3 million sentences. Yes, 64 implies the current batch I’m using because when I use other higher batch sizes I get a Cuda memory error.
Please post your full config / command line. Batch size 64 in tokens does not make much sense.
# General opts save_model: data/Geneu keep_checkpoint: 50 save_checkpoint_steps: 5000 average_decay: 0.0005 seed: 2053 report_every: 500 train_steps: 100000 valid_steps: 5000 # Batching queue_size: 10000 bucket_size: 32768 world_size: 1 gpu_ranks:  batch_type: "tokens" batch_size: 4096 valid_batch_size: 16 max_generator_batches: 0 accum_count:  accum_steps:  # Optimization model_dtype: "fp32" optim: "adam" learning_rate: 2 warmup_steps: 16000 decay_method: "noam" adam_beta2: 0.998 max_grad_norm: 0 label_smoothing: 0.1 param_init: 0 param_init_glorot: true normalization: "tokens" early_stopping: 4 # Model encoder_type: transformer decoder_type: transformer enc_layers: 4 dec_layers: 6 heads: 8 rnn_size: 512 word_vec_size: 512 transformer_ff: 2048 dropout_steps:  dropout: [0.1] attention_dropout: [0.1] #share_decoder_embeddings: true
This was my initial setup which was throwing me exception “out of memory”. Hence I had to modify my batch size and accum_count accordingly.
What GPU are you using?
When does the OOM happen?
You may need to add the
filtertoolong transform to your
transforms to ignore examples that may be too big to fit in memory.
Im using NVIDIA rtx2080ti GPU, I see that the OOM is happening after 4200 steps.As you mentioned I will try using filtertoolong option.Let me know if there is any major mistake in the set up i have been using.
Just one note @prashanth - I believe
world_size is irrelevant here. You do not change it unless you have this number of GPUs which should reflect on
gpu_ranks as well. I assume it might be a typo, but just in case.
Based on this thread world_size is relevant True_batch_size.Taking this into consideration I had done those modifications.
Yasmin is right.
world_size is the number of GPUs that will be used. In your case, since you seem to be using only one GPU, then it must remain
Also, since you’re using an RTX 2080Ti, you might want to set
model_dtype to “fp16” to take advantage of mixed precision training (less VRAM usage and a bit faster).
The above setting was the initial setting which was throwing OOM exception. So I was trying other combinations based on True Batch.
64*6*192 was run using 6 GPU’s.