Error in `<path>/torch/install/bin/lua': double free or corruption (!prev)

(László Laki) #1

Dear Colleagues,

I have the following issue during translation time. In the case of every 8-10 translations I got the following error.

$ export THC_CACHING_ALLOCATOR=0; th ./translate.lua -model .../model.t7 -src .../test.en -output .../test.ja -beam_size 5 -batch_size 150 -replace_unk -phrase_table enja.align.dic -max_sent_length 350 -length_norm 0.6 -coverage_norm 0 -eos_norm 0 -gpuid 2 *** Error in '<path>/torch/install/bin/lua': double free or corruption (!prev): 0x00000000013341b0 *** ======= Backtrace: ========= /lib64/[0x7fb3ef434619] /lib64/[0x7fb3ef436918] /lib64/[0x7fb3ef438752] <path>/torch/install/bin/lua[0x415142] <path>/torch/install/bin/lua[0x40e6e9] <path>/torch/install/bin/lua[0x40ef12] <path>/torch/install/bin/lua[0x41e4dd] <path>/torch/install/bin/lua[0x40f458] <path>/torch/install/bin/lua(lua_callk+0x3b)[0x40801b] <path>/torch/install/bin/lua[0x42ad22] <path>/torch/install/bin/lua[0x42ae30]

I got the same error in the case of all of my machines. Can you help me figure out what I made wrong?

I have already created an issue in the github issue list:

My environment:
OpenNMT v0.9.7
CUDA 9.1
Torch master

Best regards,

(Guillaume Klein) #2


I think you should try with a lower batch size. The default 30 is already a good balance for performance and requires less memory.

(László Laki) #3

Dear Guillaume,

Thanks to reply my issue. I am working on a GTX 1080 Ti and this branch size uses max 4GB of memory. The interesting thing is if I retry translation it is working fine. It will translate 4-5 times, than I gave this error.

  • Might it be a lua garbage collection issue (I am absolutely new in lua, I just google out this question)?
  • Could it be a cuda9 issue? I tried the same environment in ubuntu and it works fine for me, but I have the same issue in more than one Centos7 systems.


(jean.senellart) #4

Hello László, for reference, I can reproduce that behaviour on one our servers - it appeared recently and wonder too if it can be connected to cuda 9. We will try to narrow it down.

(R) #5

For what it’s worth, I have the same issue. It can occur either for preprocessing, training or translating. It is not reproducible for me (i.e. if I rerun the exact same script after an error it usually works). Would be nice if this was fixed, I often run full pipelines (preprocess, train, translate) for multiple parameter settings, but due to some of them fail sometimes.