IndexError when training

nplus · March 27, 2019, 5:29pm

Experimenting with the Transformer model and training errors out before 10k steps:

Step 7800/200000; acc: 25.59; ppl: 48.25; xent: 3.88; lr: 0.00096; 212/101 tok/s; 7194 sec [6/1881]

Traceback (most recent call last):
  File "/home/u/OpenNMT-py/train.py", line 109, in <module>
    main(opt)
  File "/home/u/OpenNMT-py/train.py", line 39, in main
    single_main(opt, 0)
  File "/home/u/OpenNMT-py/onmt/train_single.py", line 116, in main
    valid_steps=opt.valid_steps)
  File "/home/u/OpenNMT-py/onmt/trainer.py", line 192, in train
    self._accum_batches(train_iter)):
  File "/home/u/OpenNMT-py/onmt/trainer.py", line 127, in _accum_batches
    for batch in iterator:
  File "/home/u/OpenNMT-py/onmt/inputters/inputter.py", line 588, in __iter__
    for batch in self._iter_dataset(path):
  File "/home/u/OpenNMT-py/onmt/inputters/inputter.py", line 573, in _iter_dataset
    for batch in cur_iter:
  File "/home/u/anaconda3/lib/python3.6/site-packages/torchtext/data/iterator.py", line 156, in __iter__
    yield Batch(minibatch, self.dataset, self.device)
  File "/home/u/anaconda3/lib/python3.6/site-packages/torchtext/data/batch.py", line 34, in __init__
    setattr(self, name, field.process(batch, device=device))
  File "/home/u/OpenNMT-py/onmt/inputters/text_dataset.py", line 121, in process
    base_data = self.base_field.process(batch_by_feat[0], device=device)
IndexError: list index out of range

The documentation lists many hyperparameters that are needed for the transformer, so I’m not sure which one of them may be affecting this. I did have to adjust some because my GPU setup couldn’t handle the ones in the documentation. I am also using pretrained embeddings. Here are the train params I am using:

~/OpenNMT-py/train.py
-save_model data/model
-pre_word_vecs_enc ~/data/embeddings.enc.pt
-pre_word_vecs_dec ~/data/embeddings.dec.pt
-data ~/data/data
-save_model ~/data/model
-layers 6
-rnn_size 512
-word_vec_size 512
-transformer_ff 2048
-heads 8
-encoder_type transformer
-decoder_type transformer
-position_encoding
-train_steps 200000
-max_generator_batches 2
-dropout 0.1
-batch_size 128
-batch_type tokens
-normalization tokens
-accum_count 2
-optim adam
-adam_beta2 0.998
-decay_method noam
-warmup_steps 8000
-learning_rate 2
-max_grad_norm 0
-param_init 0
-param_init_glorot
-label_smoothing 0.1
-valid_steps 10000
-save_checkpoint_steps 10000
-world_size 1
-gpu_ranks 0
-train_from ~/data/model_step_200.pt

guillaumekln · March 28, 2019, 4:33pm

Can you try a larger batch size? Note that it is expressed in number of tokens, not examples.

nplus · March 28, 2019, 6:45pm

Thanks for the response. I can try a larger batch size - I reduced it because the GPU I am working with has 8GB memory, and while the model would fit comfortably into the GPU’s memory during training, the process was erroring out during validation at 10000 steps when it tried to allocate an additional 3GB of memory.

Can you think of a way to not have it allocate so much memory during the validation step, or do you think I need to invest in a GPU with more than 8GB memory capacity?

guillaumekln · March 28, 2019, 7:43pm

Check for sentences that are too long in your validation data.