Crash in WordEmbedding.lua

maplewizard · March 8, 2017, 9:16am

Hello, I am confronted with a crash in WordEmbedding.lua, the log is

THCudaCheck FAIL file=/home/sai/tmp/cutorch/lib/THC/generic/THCTensorMath.cu line=35 error=8 : invalid device function
/home/sai/z49/opt/thg/bin/luajit: ./onmt/modules/WordEmbedding.lua:35: cuda runtime error (8) : invalid device function at /home/sai/tmp/cutorch/lib/THC/generic/THCTensorMath.cu:35
stack traceback:
        [C]: in function 'zero'
        ./onmt/modules/WordEmbedding.lua:35: in function 'postParametersInitialization'
        ./onmt/Model.lua:67: in function 'callback'
        /home/sai/z49/opt/thg/share/lua/5.1/nn/Module.lua:352: in function 'apply'
        /home/sai/z49/opt/thg/share/lua/5.1/nn/Module.lua:356: in function 'apply'
        /home/sai/z49/opt/thg/share/lua/5.1/nn/Module.lua:356: in function 'apply'
        ./onmt/Model.lua:65: in function 'initParams'
        ./onmt/train/Trainer.lua:67: in function 'closure'
        ./onmt/utils/Parallel.lua:79: in function 'launch'
        ./onmt/train/Trainer.lua:62: in function 'train'
        train.lua:129: in function 'main'
        train.lua:134: in main chunk
        [C]: at 0x00404a10

I print the variable and found that It seems crash in a zero() function of a torch.CudaTensor, I tried the cpu verision, it works fine and I tried the following code, it also works

require 'torch'
require 'cutorch'
a = torch.CudaTensor(500)
print (a)
a:zero()
print (a)

Could anyone help to find the reason?

guillaumekln · March 8, 2017, 10:15am

Hi,

Which version are you using and what are the options you used?

maplewizard · March 8, 2017, 11:10am

Hi, @guillaumekln ,
I use LuaJIT as:

th -v
LuaJIT 2.1.0-beta2 -- Copyright (C) 2005-2017 Mike Pall. http://luajit.org/

I don’t know the version of OpenNMT, I didn’t find parameters for that

guillaumekln · March 8, 2017, 11:41am

What training options are you using?

maplewizard · March 8, 2017, 11:46am

I use the train script as:

th preprocess.lua -train_src mydata/src-train.txt -train_tgt mydata/tgt-train.txt -valid_src mydata/src-val.txt -valid_tgt mydata/tgt-val.txt -save_data mydata/demo100w -seq_length 100 -src_vocab_size 300000 -tgt_vocab_size 120000

th train.lua -data mydata/demo100w-train.t7 -save_model mymodel -gpuid 1

guillaumekln · March 8, 2017, 12:28pm

Which GPU model are you using?

You are requiring large vocabulary sizes so you need to make sure you have enough memory. Could you try with smaller values? If that does not help, also consider reinstalling the latest Torch version.