TransformerBig model - GTX 1080 Ti (11G) - ResourceExhaustedError

chiapas · December 3, 2018, 5:23pm

I tried to use TransformerBig model for French -> English translation.
The vocabulary size for each language are both 30000 tokens.

I use opennmt-tf.

I have one GPU - GTX 1080 Ti with 11GB Memory.

The error message shows some information:

Allocator (GPU_0_bfc) ran out of memory trying to allocate 322.28MiB.

2018-12-03 18:03:59.608122: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats:
Limit: 10554428621
InUse: 10287149568
MaxInUse: 10409232128
NumAllocs: 2559943
MaxAllocSize: 430094336

However, even with batch_size = 32, I got ResourceExhaustedError. And the discussion on this post

shows that it’s better to use batch_size = 8192 or 4096 (But that is for 8 GPU). But I assume with 1 GPU, I can at least use batch_size = 1024 or 512.

With 1 GPU with 11G memory, how large the batch_size can be with TransformerBig model? Should I cange other parameters to make it work?

chiapas · December 3, 2018, 5:53pm

It seems it’s a tensorflow issue, see

https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory

The solution in tensorflow is:

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)

However, can we specify this behavior in yam configuration file?

guillaumekln · December 4, 2018, 8:33am

That will not fix the issue, on the contrary. It will further reduce the memory available to the process.

Are you using automatic configuration (--auto_config)? If yes, try turning gradient accumulation off:

params:
  gradients_accum: 1

With that, it looks like I can run with batch size 1024 on a 6GB GPU.

chiapas · December 4, 2018, 10:26am

It’s still not working. Could you share you parameters in .yam and the complete parameters when training is launched, please? The following are mines:

My configuration:

# The directory where models and summaries will be saved. It is created if it does not exist.
model_dir: Model/Transformer-Big/2002756/
 
data:
  # (required for train_and_eval and train run types).
  train_features_file: Data/Training/europarl-v7.fr-en.fr_raw_2002756
  train_labels_file:   Data/Training/europarl-v7.fr-en.en_raw_2002756
 
  # (required for train_end_eval and eval run types).
  eval_features_file: Data/Evaluation/newstest-2013.fr_raw.txt
  eval_labels_file:   Data/Evaluation/newstest-2013.en_raw.txt
 
  # (optional) Models may require additional resource files (e.g. vocabularies).
  source_words_vocabulary: fr-vocab-30000-tokenized.txt
  target_words_vocabulary: en-vocab-30000-tokenized.txt
 
  source_tokenizer_config: tokenization.yml
  target_tokenizer_config: tokenization.yml
 
# Training options.
train:
  batch_size: 1024
  
  # (optional) Batch size is the number of "examples" or "tokens" (default: "examples").
  batch_type: examples
 
  # (optional) Save a checkpoint every this many steps.
  save_checkpoints_steps: 1000
  
  # (optional) How many checkpoints to keep on disk.
  keep_checkpoint_max: 10
 
  # (optional) Save summaries every this many steps.
  save_summary_steps: 1000
 
  # (optional) Train for this many steps. If not set, train forever.
  train_steps: 1280000
 
  # (optional) The number of threads to use for processing data in parallel (default: 4).
  num_threads: 8
 
  # (optional) The number of elements from which to sample during shuffling (default: 500000).
  # Set 0 or null to disable shuffling, -1 to match the number of training examples.
  sample_buffer_size: 500000
 
  # (optional) Number of checkpoints to average at the end of the training to the directory
  # model_dir/avg (default: 0).
  average_last_checkpoints: 10
 
 
# (optional) Evaluation options.
eval:
  # (optional) The batch size to use (default: 32).
  batch_size: 1024
 
  # (optional) The number of threads to use for processing data in parallel (default: 1).
  num_threads: 8
 
  # (optional) Evaluate every this many seconds (default: 18000).
  eval_delay: 0
 
  # (optional) Save evaluation predictions in model_dir/eval/.
  save_eval_predictions: True
  
  # (optional) Evalutator or list of evaluators that are called on the saved evaluation predictions.
  # Available evaluators: BLEU, BLEU-detok, ROUGE
  external_evaluators: [BLEU, BLEU-detok]
 
  # (optional) Model exporter(s) to use during the training and evaluation loop:
  # last, final, best, or null (default: last).
  exporters: last

The information shown when training is launched:

INFO:tensorflow:Using parameters: {
  "data": {
    "eval_features_file": "Data/Evaluation/newstest-2013.fr_raw.txt",
    "eval_labels_file": "Data/Evaluation/newstest-2013.en_raw.txt",
    "source_tokenizer_config": "tokenization.yml",
    "source_words_vocabulary": "fr-vocab-30000-tokenized.txt",
    "target_tokenizer_config": "tokenization.yml",
    "target_words_vocabulary": "en-vocab-30000-tokenized.txt",
    "train_features_file": "Data/Training/europarl-v7.fr-en.fr_raw_2002756",
    "train_labels_file": "Data/Training/europarl-v7.fr-en.en_raw_2002756"
  },
  "eval": {
    "batch_size": 1024,
    "eval_delay": 0,
    "exporters": "last",
    "external_evaluators": [
      "BLEU",
      "BLEU-detok"
    ],
    "num_threads": 8,
    "save_eval_predictions": true
  },
  "infer": {
    "batch_size": 32,
    "bucket_width": 5
  },
  "model_dir": "Model/Transformer-Big/2002756/",
  "params": {
    "average_loss_in_time": true,
    "beam_width": 4,
    "decay_params": {
      "model_dim": 1024,
      "warmup_steps": 8000
    },
    "decay_type": "noam_decay_v2",
    "gradients_accum": 1,
    "label_smoothing": 0.1,
    "learning_rate": 2.0,
    "length_penalty": 0.6,
    "optimizer": "LazyAdamOptimizer",
    "optimizer_params": {
      "beta1": 0.9,
      "beta2": 0.998
    }
  },
  "score": {
    "batch_size": 64
  },
  "train": {
    "average_last_checkpoints": 10,
    "batch_size": 1024,
    "batch_type": "examples",
    "bucket_width": 1,
    "keep_checkpoint_max": 10,
    "maximum_features_length": 100,
    "maximum_labels_length": 100,
    "num_threads": 8,
    "sample_buffer_size": 500000,
    "save_checkpoints_steps": 1000,
    "save_summary_steps": 1000,
    "train_steps": 1280000
  }
}

guillaumekln · December 4, 2018, 10:29am

This should be batch_type: tokens.

chiapas · December 4, 2018, 5:57pm

Thank you. It works now, I can even use batch_size = 3072.
However, it only works with mode = train, but not mode = train_and_eval.
So I cannot monitor blue score during training. Do you know a solution? Thanks.

guillaumekln · December 4, 2018, 7:25pm

Do you mean you get an out-of-memory error during evaluation? Did you try reduce the evaluation batch size?

chiapas · December 4, 2018, 7:35pm

Currently, I use the same batch_size as traing. And it seems a wrong choice, and I don’t need a big batch_size during evaluation. I will try it and report.

chiapas · December 5, 2018, 4:45pm

Reducing batch_size in eval works. But from time to time, I get “DataLossError …: Checksum does not match”. I think it’s my filie system or my tensorflow problem…