Bad argument #2 to '?': Error during training


(Tiago Cortinhal) #1

Hey!

So I was training a model:
[05/21/18 11:09:28 INFO] Using GPU(s): 1, 2
[05/21/18 11:09:28 WARNING] The caching CUDA memory allocator is enabled. This allocator improves performance at the cost of a higher GPU memory usage. To optimize for memory, consider disabling it by setting the environment variable: THC_CACHING_ALLOCATOR=0
[05/21/18 11:09:28 WARNING] For improved efficiency with multiple GPUs, consider installing nccl
[05/21/18 11:09:28 INFO] Training Sequence to Sequence with Attention model…
[05/21/18 11:09:28 INFO] Loading data from ‘…/news_UN/news_un_pre-train.t7’…
[05/21/18 11:16:33 INFO] * vocabulary size: source = 85223; target = 70472
[05/21/18 11:16:33 INFO] * additional features: source = 0; target = 0
[05/21/18 11:16:33 INFO] * maximum sequence length: source = 50; target = 51
[05/21/18 11:16:33 INFO] * number of training sentences: 10597851
[05/21/18 11:16:33 INFO] * number of batches: 146945
[05/21/18 11:16:33 INFO] - source sequence lengths: equal
[05/21/18 11:16:33 INFO] - maximum size: 250 sentences / 1800 tokens
[05/21/18 11:16:33 INFO] - average size: 72.12
[05/21/18 11:16:33 INFO] - capacity: 100.00%
[05/21/18 11:16:33 INFO] Loading checkpoint ‘…/news_UN/news_UN_checkpoint.t7’…
[05/21/18 11:16:45 INFO] Resuming training from epoch 1 at iteration 72001…
[05/21/18 11:16:52 INFO] Preparing memory optimization…
[05/21/18 11:16:53 INFO] * sharing 70% of output/gradInput tensors memory between clones
[05/21/18 11:16:53 INFO] Preallocating memory
[05/21/18 11:17:31 INFO] Restoring random number generator states…
[05/21/18 11:17:31 INFO] Start training from epoch 1 to 13…

and and I got the following error:

: bad argument #2 to ‘?’ (out of range at /home/tiagofilipe233/torch/pkg/torch/generic/Tensor.c:913)
stack traceback:
[C]: at 0x7fbf24c6c4b0
[C]: in function ‘__index’
./onmt/train/Trainer.lua:224: in function ‘getBatchIdx’
./onmt/train/Trainer.lua:266: in function ‘trainEpoch’
./onmt/train/Trainer.lua:474: in function ‘train’
train.lua:333: in function ‘main’
train.lua:338: in main chunk
[C]: in function ‘dofile’
…e233/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x562379e84610

Can you guys help me sort this out?

Thank you :slight_smile: