TransformerBig model - GTX 1080 Ti (11G) - ResourceExhaustedError



I tried to use TransformerBig model for French -> English translation.
The vocabulary size for each language are both 30000 tokens.

I use opennmt-tf.

I have one GPU - GTX 1080 Ti with 11GB Memory.

The error message shows some information:

Allocator (GPU_0_bfc) ran out of memory trying to allocate 322.28MiB.

2018-12-03 18:03:59.608122: I tensorflow/core/common_runtime/] Stats:
Limit: 10554428621
InUse: 10287149568
MaxInUse: 10409232128
NumAllocs: 2559943
MaxAllocSize: 430094336

However, even with batch_size = 32, I got ResourceExhaustedError. And the discussion on this post

shows that it’s better to use batch_size = 8192 or 4096 (But that is for 8 GPU). But I assume with 1 GPU, I can at least use batch_size = 1024 or 512.

With 1 GPU with 11G memory, how large the batch_size can be with TransformerBig model? Should I cange other parameters to make it work?


It seems it’s a tensorflow issue, see

The solution in tensorflow is:

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)

However, can we specify this behavior in yam configuration file?

(Guillaume Klein) #3

That will not fix the issue, on the contrary. It will further reduce the memory available to the process.

Are you using automatic configuration (--auto_config)? If yes, try turning gradient accumulation off:

  gradients_accum: 1

With that, it looks like I can run with batch size 1024 on a 6GB GPU.


It’s still not working. Could you share you parameters in .yam and the complete parameters when training is launched, please? The following are mines:

My configuration:

# The directory where models and summaries will be saved. It is created if it does not exist.
model_dir: Model/Transformer-Big/2002756/
  # (required for train_and_eval and train run types).
  train_features_file: Data/Training/
  train_labels_file:   Data/Training/
  # (required for train_end_eval and eval run types).
  eval_features_file: Data/Evaluation/newstest-2013.fr_raw.txt
  eval_labels_file:   Data/Evaluation/newstest-2013.en_raw.txt
  # (optional) Models may require additional resource files (e.g. vocabularies).
  source_words_vocabulary: fr-vocab-30000-tokenized.txt
  target_words_vocabulary: en-vocab-30000-tokenized.txt
  source_tokenizer_config: tokenization.yml
  target_tokenizer_config: tokenization.yml
# Training options.
  batch_size: 1024
  # (optional) Batch size is the number of "examples" or "tokens" (default: "examples").
  batch_type: examples
  # (optional) Save a checkpoint every this many steps.
  save_checkpoints_steps: 1000
  # (optional) How many checkpoints to keep on disk.
  keep_checkpoint_max: 10
  # (optional) Save summaries every this many steps.
  save_summary_steps: 1000
  # (optional) Train for this many steps. If not set, train forever.
  train_steps: 1280000
  # (optional) The number of threads to use for processing data in parallel (default: 4).
  num_threads: 8
  # (optional) The number of elements from which to sample during shuffling (default: 500000).
  # Set 0 or null to disable shuffling, -1 to match the number of training examples.
  sample_buffer_size: 500000
  # (optional) Number of checkpoints to average at the end of the training to the directory
  # model_dir/avg (default: 0).
  average_last_checkpoints: 10
# (optional) Evaluation options.
  # (optional) The batch size to use (default: 32).
  batch_size: 1024
  # (optional) The number of threads to use for processing data in parallel (default: 1).
  num_threads: 8
  # (optional) Evaluate every this many seconds (default: 18000).
  eval_delay: 0
  # (optional) Save evaluation predictions in model_dir/eval/.
  save_eval_predictions: True
  # (optional) Evalutator or list of evaluators that are called on the saved evaluation predictions.
  # Available evaluators: BLEU, BLEU-detok, ROUGE
  external_evaluators: [BLEU, BLEU-detok]
  # (optional) Model exporter(s) to use during the training and evaluation loop:
  # last, final, best, or null (default: last).
  exporters: last

The information shown when training is launched:

INFO:tensorflow:Using parameters: {
  "data": {
    "eval_features_file": "Data/Evaluation/newstest-2013.fr_raw.txt",
    "eval_labels_file": "Data/Evaluation/newstest-2013.en_raw.txt",
    "source_tokenizer_config": "tokenization.yml",
    "source_words_vocabulary": "fr-vocab-30000-tokenized.txt",
    "target_tokenizer_config": "tokenization.yml",
    "target_words_vocabulary": "en-vocab-30000-tokenized.txt",
    "train_features_file": "Data/Training/",
    "train_labels_file": "Data/Training/"
  "eval": {
    "batch_size": 1024,
    "eval_delay": 0,
    "exporters": "last",
    "external_evaluators": [
    "num_threads": 8,
    "save_eval_predictions": true
  "infer": {
    "batch_size": 32,
    "bucket_width": 5
  "model_dir": "Model/Transformer-Big/2002756/",
  "params": {
    "average_loss_in_time": true,
    "beam_width": 4,
    "decay_params": {
      "model_dim": 1024,
      "warmup_steps": 8000
    "decay_type": "noam_decay_v2",
    "gradients_accum": 1,
    "label_smoothing": 0.1,
    "learning_rate": 2.0,
    "length_penalty": 0.6,
    "optimizer": "LazyAdamOptimizer",
    "optimizer_params": {
      "beta1": 0.9,
      "beta2": 0.998
  "score": {
    "batch_size": 64
  "train": {
    "average_last_checkpoints": 10,
    "batch_size": 1024,
    "batch_type": "examples",
    "bucket_width": 1,
    "keep_checkpoint_max": 10,
    "maximum_features_length": 100,
    "maximum_labels_length": 100,
    "num_threads": 8,
    "sample_buffer_size": 500000,
    "save_checkpoints_steps": 1000,
    "save_summary_steps": 1000,
    "train_steps": 1280000

(Guillaume Klein) #5

This should be batch_type: tokens.


Thank you. It works now, I can even use batch_size = 3072.
However, it only works with mode = train, but not mode = train_and_eval.
So I cannot monitor blue score during training. Do you know a solution? Thanks.

(Guillaume Klein) #7

Do you mean you get an out-of-memory error during evaluation? Did you try reduce the evaluation batch size?


Currently, I use the same batch_size as traing. And it seems a wrong choice, and I don’t need a big batch_size during evaluation. I will try it and report.


Reducing batch_size in eval works. But from time to time, I get “DataLossError …: Checksum does not match”. I think it’s my filie system or my tensorflow problem…