Error in `<path>/torch/install/bin/lua': double free or corruption (!prev)

laklaja · February 20, 2018, 9:06am

Dear Colleagues,

I have the following issue during translation time. In the case of every 8-10 translations I got the following error.

$ export THC_CACHING_ALLOCATOR=0; th ./translate.lua -model .../model.t7 -src .../test.en -output .../test.ja -beam_size 5 -batch_size 150 -replace_unk -phrase_table enja.align.dic -max_sent_length 350 -length_norm 0.6 -coverage_norm 0 -eos_norm 0 -gpuid 2 *** Error in '<path>/torch/install/bin/lua': double free or corruption (!prev): 0x00000000013341b0 *** ======= Backtrace: ========= /lib64/libc.so.6(+0x7c619)[0x7fb3ef434619] /lib64/libc.so.6(+0x7e918)[0x7fb3ef436918] /lib64/libc.so.6(realloc+0x1b2)[0x7fb3ef438752] <path>/torch/install/bin/lua[0x415142] <path>/torch/install/bin/lua[0x40e6e9] <path>/torch/install/bin/lua[0x40ef12] <path>/torch/install/bin/lua[0x41e4dd] <path>/torch/install/bin/lua[0x40f458] <path>/torch/install/bin/lua(lua_callk+0x3b)[0x40801b] <path>/torch/install/bin/lua[0x42ad22] <path>/torch/install/bin/lua[0x42ae30]

I got the same error in the case of all of my machines. Can you help me figure out what I made wrong?

I have already created an issue in the github issue list:

My environment:
OpenNMT v0.9.7
CUDA 9.1
CUDNN 7
Centos7
Torch master

Best regards,
László

guillaumekln · February 21, 2018, 7:31am

Hello,

I think you should try with a lower batch size. The default 30 is already a good balance for performance and requires less memory.

laklaja · February 21, 2018, 7:45am

Dear Guillaume,

Thanks to reply my issue. I am working on a GTX 1080 Ti and this branch size uses max 4GB of memory. The interesting thing is if I retry translation it is working fine. It will translate 4-5 times, than I gave this error.

Might it be a lua garbage collection issue (I am absolutely new in lua, I just google out this question)?
Could it be a cuda9 issue? I tried the same environment in ubuntu and it works fine for me, but I have the same issue in more than one Centos7 systems.

László

jean.senellart · March 5, 2018, 9:22pm

Hello László, for reference, I can reproduce that behaviour on one our servers - it appeared recently and wonder too if it can be connected to cuda 9. We will try to narrow it down.
best
Jean

nvr-rug · March 8, 2018, 3:37pm

For what it’s worth, I have the same issue. It can occur either for preprocessing, training or translating. It is not reproducible for me (i.e. if I rerun the exact same script after an error it usually works). Would be nice if this was fixed, I often run full pipelines (preprocess, train, translate) for multiple parameter settings, but due to some of them fail sometimes.

laklaja · March 27, 2018, 8:38am

Thanks Jean!

As nvr-rug describes above I have the same isseu with preprocess.lua as well.

laklaja · April 10, 2018, 7:15pm

Hello,

I have the following situation and I would like to ask could it be the case of my issue? I ran multiple trainings in my server (2 training on 1-1 GPU) and as I see they have allocated all my available virtual memory space. Now I gave this issue when I start translation all the time. Furthermore I tried to run cuda samples and it gave me same error message. What do you think, could it be my problem? I don’t want to stop my training now, to test my theory but as soon as I can I will write the result.

Thanks,
László

jean.senellart · April 12, 2018, 7:45pm

Hi László, sorry about non-response on that. On our side, we did re-install our server (cuda libraries/driver), and the issue went away. Can you share exact version of driver/libraries?

best
Jean

laklaja · April 16, 2018, 6:13pm

Dear Jean,

For the previous comment. As soon as one of the training finished the translation started working again.

Please tell me if I miss something:

CentOS Linux release 7.4.1708
GeForce GTX 1080 Ti
NVIDIA Driver versin: 390.25
cuda_9.1.85
cudnn-9.1-linux-x64-v7.solitairetheme8
OpenNMT tag v0.9.7
torch: commit 20e5237 (18/oct/2017)

Thanks,
László