Out Of Memory with Dynamic Dataset, LuaJit, preprocess_nthreads > 1


(Etienne Monneret) #1

I’m on a machine with 64G of RAM.

First, I had an other process using 16G of RAM, and got a kind of OutOfMemory error in the “Preparing” step, while preparing the larger file.

I killed the other process. Now, the train process seems to be idle in the “Preparing” step, on the larger file : no CPU usage, very low RAM usage, no I/O usage, no more text sent to the LOG, but still running.

[09/11/17 15:33:27 INFO] Parsing train data from directory '/home/lm-dev8/mmt_2017-07-05/DATA/train_FREN':	
[09/11/17 15:33:27 INFO]  * [2] Reading files 'OpenOffice.fr-en.' - 31902 sentences	
[09/11/17 15:33:28 INFO]  * [4] Reading files 'LM_TRANSPORT.fr-en.' - 211199 sentences	
[09/11/17 15:33:28 INFO]  * [4] Reading files 'LM_COMPUTING.fr-en.' - 193617 sentences	
[09/11/17 15:33:28 INFO]  * [4] Reading files 'KDE4.fr-en.' - 180709 sentences	
[09/11/17 15:33:28 INFO]  * [4] Reading files 'LM_COOKING.fr-en.' - 74148 sentences	
[09/11/17 15:33:28 INFO]  * [4] Reading files 'Gnome.fr-en.' - 55391 sentences	
[09/11/17 15:33:28 INFO]  * [4] Reading files 'LM_MANAGEMENT.fr-en.' - 103575 sentences	
[09/11/17 15:33:30 INFO]  * [2] Reading files 'Wikipedia.fr-en.' - 803670 sentences	
[09/11/17 15:33:30 INFO]  * [2] Reading files 'Ubuntu.fr-en.' - 9314 sentences	
[09/11/17 15:33:30 INFO]  * [2] Reading files 'EMEA.fr-en.' - 373152 sentences	
[09/11/17 15:33:30 INFO]  * [2] Reading files 'PHP.fr-en.' - 16020 sentences	
[09/11/17 15:33:31 INFO]  * [2] Reading files 'ECB.fr-en.' - 195949 sentences	
[09/11/17 15:33:33 INFO]  * [1] Reading files 'DGT.fr-en.' - 1987655 sentences	
[09/11/17 15:33:34 INFO]  * [4] Reading files 'europarl.fr-en.' - 2007723 sentences	
[09/11/17 15:34:02 INFO]  * [3] Reading files 'MultiUN.fr-en.' - 10480212 sentences	
[09/11/17 15:34:02 INFO] 16724236 sentences, in 15 files, in train directory	
...
[09/11/17 15:34:04 INFO] --- Preparing train sample
[09/11/17 15:34:23 INFO]  * [4] file 'OpenOffice.fr-en.': 31902 total, 3816 drawn, 3779 kept - unknown words: source = 31.1%, target = 20.6%
[09/11/17 15:34:24 INFO]  * [4] file 'LM_TRANSPORT.fr-en.': 211199 total, 25257 drawn, 25161 kept - unknown words: source = 18.7%, target = 10.1%
[09/11/17 15:34:28 INFO]  * [1] file 'LM_COMPUTING.fr-en.': 193617 total, 23155 drawn, 22086 kept - unknown words: source = 34.9%, target = 27.3%
[09/11/17 15:34:28 INFO]  * [3] file 'KDE4.fr-en.': 180709 total, 21611 drawn, 20638 kept - unknown words: source = 45.5%, target = 30.8%
[09/11/17 15:34:29 INFO]  * [2] file 'LM_COOKING.fr-en.': 74148 total, 8868 drawn, 8699 kept - unknown words: source = 43.5%, target = 22.4%
[09/11/17 15:34:31 INFO]  * [2] file 'Ubuntu.fr-en.': 9314 total, 1114 drawn, 1108 kept - unknown words: source = 50.9%, target = 30.6%
[09/11/17 15:34:33 INFO]  * [1] file 'LM_MANAGEMENT.fr-en.': 103575 total, 12387 drawn, 11582 kept - unknown words: source = 17.7%, target = 15.5%
[09/11/17 15:34:33 INFO]  * [4] file 'Gnome.fr-en.': 55391 total, 6625 drawn, 6360 kept - unknown words: source = 38.7%, target = 28.0%
[09/11/17 15:34:33 INFO]  * [1] file 'PHP.fr-en.': 16020 total, 1916 drawn, 1815 kept - unknown words: source = 28.4%, target = 18.7%
[09/11/17 15:34:34 INFO]  * [2] file 'EMEA.fr-en.': 373152 total, 44625 drawn, 42271 kept - unknown words: source = 32.9%, target = 23.2%
[09/11/17 15:34:36 INFO]  * [4] file 'ECB.fr-en.': 195949 total, 23433 drawn, 18530 kept - unknown words: source = 25.0%, target = 19.9%
[09/11/17 15:34:42 INFO]  * [3] file 'Wikipedia.fr-en.': 803670 total, 96109 drawn, 90160 kept - unknown words: source = 26.8%, target = 24.2%
[09/11/17 15:35:28 INFO]  * [1] file 'DGT.fr-en.': 1987655 total, 237698 drawn, 202591 kept - unknown words: source = 20.9%, target = 18.9%
[09/11/17 15:35:36 INFO]  * [2] file 'europarl.fr-en.': 2007723 total, 240098 drawn, 209622 kept - unknown words: source = 6.4%, target = 20.5%

It’s now 16:25… without new LOG line…

:neutral_face:


Dynamic Dataset
(jean.senellart) #2

I observed such a freeze once while I was testing but could not reproduce - looks like a thread racing issue. Can you stop and relaunch? Do you observe systematically?


(Etienne Monneret) #3

Yes. After several attempts, as said, I first got few OutOfMemory error, then few freezes…


(Etienne Monneret) #4

Very strange ! I made a manual copy/paste of the command line in the terminal (rather than using a “.sh” file). I don’t think this is making a difference, but now, the OutOfMemory error is back… But, the RAM was always very LOW !

[09/11/17 19:03:16 INFO]  * [3] file 'DGT.fr-en.': 1987655 total, 191344 drawn, 163277 kept - unknown words: source = 20.9%, target = 18.9%	
[09/11/17 19:03:28 INFO]  * [4] file 'europarl.fr-en.': 2007723 total, 193275 drawn, 168477 kept - unknown words: source = 6.4%, target = 20.5%	
FATAL THREAD PANIC: (addjob) not enough memory

This sounds like a process is using a pre-defined amount of memory, too low for this large file size… or a limited 32 bits processing ?


(jean.senellart) #5

are you using LUAJIT or LUA?


(jean.senellart) #7

might be the difference, this is the expected 2Gb memory limitation of LuaJIT - and this might be reached because of parallel prcessing. Can you try reducing the number of parallel process: -preprocess_pthreads 1 ?


(Etienne Monneret) #8

It works !
:grinning:
The question now is : what is the consequences on this step for even larger files, and on the whole process in general ?


(jean.senellart) #9

using -preprocess_pthreads 1 is just disabling the parallel processing speed-up which will make the sampling on very large dataset far slower - I will check how I can reduce that limitation for LuaJit.


(Vincent Nguyen) #10

@jean.senellart
Do you recall someone used tds vec in BPE lua to circumvent the memory issue with luajit.
maybe indexing your segments the same way could be the solution, it was much faster too.


(jean.senellart) #11

(jean.senellart) #12

@Etienne38 - current patch here:

seems to work for us, but the issue is a bit random and we don’t always reproduce - can you please check for you?


(Etienne Monneret) #13

Got this error:

[10/06/17 15:33:55 INFO] --- Preparing valid sample	
/home/lm-dev8/torch/install/bin/luajit: .../lm-dev8/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] ./onmt/data/Preprocessor.lua:732: attempt to index field 'logger' (a nil value)
stack traceback:
	./onmt/data/Preprocessor.lua:732: in function <./onmt/data/Preprocessor.lua:599>
	[C]: in function 'xpcall'
	.../lm-dev8/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback'
	/home/lm-dev8/torch/install/share/lua/5.1/threads/queue.lua:65: in function </home/lm-dev8/torch/install/share/lua/5.1/threads/queue.lua:41>
	[C]: in function 'pcall'
	/home/lm-dev8/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
	[string "  local Queue = require 'threads.queue'..."]:15: in main chunk
stack traceback:
	[C]: in function 'error'
	.../lm-dev8/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob'
	.../lm-dev8/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize'
	./onmt/data/Preprocessor.lua:368: in function 'poolSynchronize'
	./onmt/data/Preprocessor.lua:791: in function 'makeGenericData'
	./onmt/data/Preprocessor.lua:873: in function 'makeBilingualData'
	./onmt/data/Preprocessor.lua:1038: in function 'makeData'
	./onmt/data/DynamicDataRepository.lua:22: in function 'getValid'
	train.lua:70: in function 'buildDataset'
	train.lua:174: in function 'main'
	train.lua:199: in main chunk
	[C]: in function 'dofile'
	...dev8/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x00405d50

I’m not sure of my actual code state. Would it be possible to clone a new fresh code, with all latest devs ?


(Etienne Monneret) #14

With -preprocess_pthreads 1 it is still working properly.

But, now, all prepared sentences are written to the log… a bit heavy !

:hushed:


(jean.senellart) #15

Thx - can you check again. for the complete log, it is triggered by log level DEBUG. Is it what you are using?


(Etienne Monneret) #16

Didn’t get the latest error, but the memory error seems still there:
FATAL THREAD PANIC: (dojob) not enough memory

Yes, but I didn’t get all sentences like this before the latest update.