TransformerBig model - GTX 1080 Ti (11G) - ResourceExhaustedError

I tried to use TransformerBig model for French -> English translation.
The vocabulary size for each language are both 30000 tokens.

I use opennmt-tf.

I have one GPU - GTX 1080 Ti with 11GB Memory.

The error message shows some information:

Allocator (GPU_0_bfc) ran out of memory trying to allocate 322.28MiB.

2018-12-03 18:03:59.608122: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats:
Limit: 10554428621
InUse: 10287149568
MaxInUse: 10409232128
NumAllocs: 2559943
MaxAllocSize: 430094336

However, even with batch_size = 32, I got ResourceExhaustedError. And the discussion on this post

shows that it’s better to use batch_size = 8192 or 4096 (But that is for 8 GPU). But I assume with 1 GPU, I can at least use batch_size = 1024 or 512.

With 1 GPU with 11G memory, how large the batch_size can be with TransformerBig model? Should I cange other parameters to make it work?

It seems it’s a tensorflow issue, see

https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory

The solution in tensorflow is:

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)

However, can we specify this behavior in yam configuration file?

That will not fix the issue, on the contrary. It will further reduce the memory available to the process.

Are you using automatic configuration (--auto_config)? If yes, try turning gradient accumulation off:

params:
  gradients_accum: 1

With that, it looks like I can run with batch size 1024 on a 6GB GPU.

It’s still not working. Could you share you parameters in .yam and the complete parameters when training is launched, please? The following are mines:


My configuration:

# The directory where models and summaries will be saved. It is created if it does not exist.
model_dir: Model/Transformer-Big/2002756/
 
data:
  # (required for train_and_eval and train run types).
  train_features_file: Data/Training/europarl-v7.fr-en.fr_raw_2002756
  train_labels_file:   Data/Training/europarl-v7.fr-en.en_raw_2002756
 
  # (required for train_end_eval and eval run types).
  eval_features_file: Data/Evaluation/newstest-2013.fr_raw.txt
  eval_labels_file:   Data/Evaluation/newstest-2013.en_raw.txt
 
  # (optional) Models may require additional resource files (e.g. vocabularies).
  source_words_vocabulary: fr-vocab-30000-tokenized.txt
  target_words_vocabulary: en-vocab-30000-tokenized.txt
 
  source_tokenizer_config: tokenization.yml
  target_tokenizer_config: tokenization.yml
 
# Training options.
train:
  batch_size: 1024
  
  # (optional) Batch size is the number of "examples" or "tokens" (default: "examples").
  batch_type: examples
 
  # (optional) Save a checkpoint every this many steps.
  save_checkpoints_steps: 1000
  
  # (optional) How many checkpoints to keep on disk.
  keep_checkpoint_max: 10
 
  # (optional) Save summaries every this many steps.
  save_summary_steps: 1000
 
  # (optional) Train for this many steps. If not set, train forever.
  train_steps: 1280000
 
  # (optional) The number of threads to use for processing data in parallel (default: 4).
  num_threads: 8
 
  # (optional) The number of elements from which to sample during shuffling (default: 500000).
  # Set 0 or null to disable shuffling, -1 to match the number of training examples.
  sample_buffer_size: 500000
 
  # (optional) Number of checkpoints to average at the end of the training to the directory
  # model_dir/avg (default: 0).
  average_last_checkpoints: 10
 
 
# (optional) Evaluation options.
eval:
  # (optional) The batch size to use (default: 32).
  batch_size: 1024
 
  # (optional) The number of threads to use for processing data in parallel (default: 1).
  num_threads: 8
 
  # (optional) Evaluate every this many seconds (default: 18000).
  eval_delay: 0
 
  # (optional) Save evaluation predictions in model_dir/eval/.
  save_eval_predictions: True
  
  # (optional) Evalutator or list of evaluators that are called on the saved evaluation predictions.
  # Available evaluators: BLEU, BLEU-detok, ROUGE
  external_evaluators: [BLEU, BLEU-detok]
 
  # (optional) Model exporter(s) to use during the training and evaluation loop:
  # last, final, best, or null (default: last).
  exporters: last

The information shown when training is launched:

INFO:tensorflow:Using parameters: {
  "data": {
    "eval_features_file": "Data/Evaluation/newstest-2013.fr_raw.txt",
    "eval_labels_file": "Data/Evaluation/newstest-2013.en_raw.txt",
    "source_tokenizer_config": "tokenization.yml",
    "source_words_vocabulary": "fr-vocab-30000-tokenized.txt",
    "target_tokenizer_config": "tokenization.yml",
    "target_words_vocabulary": "en-vocab-30000-tokenized.txt",
    "train_features_file": "Data/Training/europarl-v7.fr-en.fr_raw_2002756",
    "train_labels_file": "Data/Training/europarl-v7.fr-en.en_raw_2002756"
  },
  "eval": {
    "batch_size": 1024,
    "eval_delay": 0,
    "exporters": "last",
    "external_evaluators": [
      "BLEU",
      "BLEU-detok"
    ],
    "num_threads": 8,
    "save_eval_predictions": true
  },
  "infer": {
    "batch_size": 32,
    "bucket_width": 5
  },
  "model_dir": "Model/Transformer-Big/2002756/",
  "params": {
    "average_loss_in_time": true,
    "beam_width": 4,
    "decay_params": {
      "model_dim": 1024,
      "warmup_steps": 8000
    },
    "decay_type": "noam_decay_v2",
    "gradients_accum": 1,
    "label_smoothing": 0.1,
    "learning_rate": 2.0,
    "length_penalty": 0.6,
    "optimizer": "LazyAdamOptimizer",
    "optimizer_params": {
      "beta1": 0.9,
      "beta2": 0.998
    }
  },
  "score": {
    "batch_size": 64
  },
  "train": {
    "average_last_checkpoints": 10,
    "batch_size": 1024,
    "batch_type": "examples",
    "bucket_width": 1,
    "keep_checkpoint_max": 10,
    "maximum_features_length": 100,
    "maximum_labels_length": 100,
    "num_threads": 8,
    "sample_buffer_size": 500000,
    "save_checkpoints_steps": 1000,
    "save_summary_steps": 1000,
    "train_steps": 1280000
  }
}

This should be batch_type: tokens.

2 Likes

Thank you. It works now, I can even use batch_size = 3072.
However, it only works with mode = train, but not mode = train_and_eval.
So I cannot monitor blue score during training. Do you know a solution? Thanks.

Do you mean you get an out-of-memory error during evaluation? Did you try reduce the evaluation batch size?

Currently, I use the same batch_size as traing. And it seems a wrong choice, and I don’t need a big batch_size during evaluation. I will try it and report.

Reducing batch_size in eval works. But from time to time, I get “DataLossError …: Checksum does not match”. I think it’s my filie system or my tensorflow problem…