How to speed up when preprocess the corpus ?

steven · September 26, 2018, 9:19am

Dear all,

According to the issue, I follow this tutorial to train wmt15.

When I perform preprocess the corpus, and set preprocess_pthreads as 10, but it doesn’t seem to speed up as I expected.
How could I speed up this step ?
Thank you for your help.

th preprocess.lua -train_src ../wmt15-de-en/wmt15-all-de-en.en.tok -train_tgt ../wmt15-de-en/wmt15-all-de-en.de.tok -valid_src ../wmt15-de-en/newstest2013.en.tok -valid_tgt ../wmt15-de-en/newstest2013.de.tok -save_data ../wmt15-de-en/wmt15-all-en-de -preprocess_pthreads 10

<p.s.> My CPU has 40 thread

guillaumekln · September 26, 2018, 2:57pm

Hi,

If I remember correctly, preprocessing does not benefit much from multithreading. What is the current execution time of the preprocessing? It is usually small compared to the training time.

An alternative would be to use the “dynamic dataset” feature that removes the preprocess step and prepares the data on-the-fly during the training.

steven · September 28, 2018, 6:37am

Dear sir,

Thank you for your response.
I following official reference, there hava a parameter preprocess_pthreads that can set thread.

-preprocess_pthreads <number> (default: 4 )
Number of parallel threads for preprocessing.

Is that means, when the value of this parameter is larger, the processing speed will be faster？

I try to add timer in preprocess.lua, and execution time as following:
Not set preprocess_pthreads parameter

root@99c09856b3c2:/wmt15-ende/OpenNMT# th preprocess.lua -train_src ../wmt15-de-en/wmt15-all-de-en.en.tok -train_tgt ../wmt15-de-en/wmt15-all-de-en.de.tok -valid_src ../wmt15-de-en/newstest2013.en.tok -valid_tgt ../wmt15-de-en/newstest2013.de.tok -save_data ../wmt15-de-en/wmt15-all-en-de
[09/28/18 02:37:35 INFO] Using on-the-fly 'space' tokenization for input 1	
[09/28/18 02:37:35 INFO] Using on-the-fly 'space' tokenization for input 2	
[09/28/18 02:37:35 INFO] Preparing vocabulary...	
[09/28/18 02:37:35 INFO]  * Building source vocabularies...	
[09/28/18 02:43:39 INFO]  * Created word dictionary of size 50004 (pruned from 882957)	
[09/28/18 02:43:39 INFO] 	
[09/28/18 02:43:39 INFO]  * Building target vocabularies...	
[09/28/18 02:49:47 INFO]  * Created word dictionary of size 50004 (pruned from 1851345)	
[09/28/18 02:49:47 INFO] 	
Preparing vocabulary time: 731.43809604645 seconds	
[09/28/18 02:49:47 INFO] Preparing training data...	
[09/28/18 02:49:47 INFO] --- Preparing train sample	
[09/28/18 03:12:21 INFO]  * [-] file '../wmt15-de-en/wmt15-all-de-en.en.tok' (): 4535522 total, 4535522 drawn, 4144042 kept - unknown words: source = 2.8%, target = 6.1%	
[09/28/18 03:12:21 INFO] ... shuffling sentences	
[09/28/18 03:15:09 INFO] ... sorting sentences by size	
[09/28/18 03:17:41 INFO] Prepared 4144042 sentences:	
[09/28/18 03:17:41 INFO]  * 391480 sequences not validated (length, other)	
[09/28/18 03:17:41 INFO]  * average sequence length: source = 22.9, target = 21.8	
[09/28/18 03:17:41 INFO]  * source sentence length (range of 10): [ 7% ; 32% ; 28% ; 16% ; 8% ; 3% ; 1% ; 0% ; 0% ; 0% ]	
[09/28/18 03:17:41 INFO]  * target sentence length (range of 10): [ 8% ; 35% ; 27% ; 15% ; 7% ; 3% ; 1% ; 0% ; 0% ; 0% ]	
[09/28/18 03:17:41 INFO] 	
Preparing training data time: 1674.2263829708 seconds	
[09/28/18 03:17:41 INFO] 	
[09/28/18 03:17:41 INFO] Preparing validation data...	
[09/28/18 03:17:41 INFO] --- Preparing valid sample	
[09/28/18 03:17:42 INFO]  * [-] file '../wmt15-de-en/newstest2013.en.tok' (): 3000 total, 3000 drawn, 2891 kept - unknown words: source = 3.3%, target = 6.7%	
[09/28/18 03:17:42 INFO] ... shuffling sentences	
[09/28/18 03:17:42 INFO] ... sorting sentences by size	
[09/28/18 03:17:43 INFO] Prepared 2891 sentences:	
[09/28/18 03:17:43 INFO]  * 109 sequences not validated (length, other)	
[09/28/18 03:17:43 INFO]  * average sequence length: source = 20.4, target = 19.8	
[09/28/18 03:17:43 INFO]  * source sentence length (range of 10): [ 13% ; 36% ; 27% ; 13% ; 5% ; 2% ; 0% ; 0% ; 0% ; 0% ]	
[09/28/18 03:17:43 INFO]  * target sentence length (range of 10): [ 15% ; 37% ; 25% ; 13% ; 5% ; 1% ; 0% ; 0% ; 0% ; 0% ]	
[09/28/18 03:17:43 INFO] 	
Preparing validation data time: 1.3987720012665 seconds	
[09/28/18 03:17:43 INFO] 	
[09/28/18 03:17:43 INFO] Saving source vocabulary to '../wmt15-de-en/wmt15-all-en-de.src.dict'...	
[09/28/18 03:17:43 INFO] Saving target vocabulary to '../wmt15-de-en/wmt15-all-en-de.tgt.dict'...	
[09/28/18 03:17:43 INFO] Saving data to '../wmt15-de-en/wmt15-all-en-de-train.t7'...

Set preprocess_pthreads as 10

root@99c09856b3c2:/wmt15-ende/OpenNMT# th preprocess.lua -train_src ../wmt15-de-en/wmt15-all-de-en.en.tok -train_tgt ../wmt15-de-en/wmt15-all-de-en.de.tok -valid_src ../wmt15-de-en/newstest2013.en.tok -valid_tgt ../wmt15-de-en/newstest2013.de.tok -save_data ../wmt15-de-en/wmt15-all-en-de -preprocess_pthreads 10
[09/28/18 04:31:59 INFO] Using on-the-fly 'space' tokenization for input 1	
[09/28/18 04:31:59 INFO] Using on-the-fly 'space' tokenization for input 2	
[09/28/18 04:31:59 INFO] Preparing vocabulary...	
[09/28/18 04:31:59 INFO]  * Building source vocabularies...	
[09/28/18 04:38:05 INFO]  * Created word dictionary of size 50004 (pruned from 882957)	
[09/28/18 04:38:05 INFO] 	
[09/28/18 04:38:05 INFO]  * Building target vocabularies...	
[09/28/18 04:44:31 INFO]  * Created word dictionary of size 50004 (pruned from 1851345)	
[09/28/18 04:44:31 INFO] 	
Preparing vocabulary time: 751.86415696144 seconds	
[09/28/18 04:44:31 INFO] Preparing training data...	
[09/28/18 04:44:31 INFO] --- Preparing train sample	
[09/28/18 05:06:12 INFO]  * [-] file '../wmt15-de-en/wmt15-all-de-en.en.tok' (): 4535522 total, 4535522 drawn, 4144042 kept - unknown words: source = 2.8%, target = 6.1%	
[09/28/18 05:06:12 INFO] ... shuffling sentences	
[09/28/18 05:08:43 INFO] ... sorting sentences by size	
[09/28/18 05:11:05 INFO] Prepared 4144042 sentences:	
[09/28/18 05:11:05 INFO]  * 391480 sequences not validated (length, other)	
[09/28/18 05:11:05 INFO]  * average sequence length: source = 22.9, target = 21.8	
[09/28/18 05:11:05 INFO]  * source sentence length (range of 10): [ 7% ; 32% ; 28% ; 16% ; 8% ; 3% ; 1% ; 0% ; 0% ; 0% ]	
[09/28/18 05:11:05 INFO]  * target sentence length (range of 10): [ 8% ; 35% ; 27% ; 15% ; 7% ; 3% ; 1% ; 0% ; 0% ; 0% ]	
[09/28/18 05:11:05 INFO] 	
Preparing training data time: 1593.7116341591 seconds	
[09/28/18 05:11:05 INFO] 	
[09/28/18 05:11:05 INFO] Preparing validation data...	
[09/28/18 05:11:05 INFO] --- Preparing valid sample	
[09/28/18 05:11:06 INFO]  * [-] file '../wmt15-de-en/newstest2013.en.tok' (): 3000 total, 3000 drawn, 2891 kept - unknown words: source = 3.3%, target = 6.7%	
[09/28/18 05:11:06 INFO] ... shuffling sentences	
[09/28/18 05:11:06 INFO] ... sorting sentences by size	
[09/28/18 05:11:06 INFO] Prepared 2891 sentences:	
[09/28/18 05:11:06 INFO]  * 109 sequences not validated (length, other)	
[09/28/18 05:11:06 INFO]  * average sequence length: source = 20.4, target = 19.8	
[09/28/18 05:11:06 INFO]  * source sentence length (range of 10): [ 13% ; 36% ; 27% ; 13% ; 5% ; 2% ; 0% ; 0% ; 0% ; 0% ]	
[09/28/18 05:11:06 INFO]  * target sentence length (range of 10): [ 15% ; 37% ; 25% ; 13% ; 5% ; 1% ; 0% ; 0% ; 0% ; 0% ]	
[09/28/18 05:11:06 INFO] 	
Preparing validation data time: 1.5299642086029 seconds	
[09/28/18 05:11:06 INFO] 	
[09/28/18 05:11:06 INFO] Saving source vocabulary to '../wmt15-de-en/wmt15-all-en-de.src.dict'...	
[09/28/18 05:11:07 INFO] Saving target vocabulary to '../wmt15-de-en/wmt15-all-en-de.tgt.dict'...	
[09/28/18 05:11:07 INFO] Saving data to '../wmt15-de-en/wmt15-all-en-de-train.t7'...

Thank you for your help.

guillaumekln · September 28, 2018, 7:41am

Please note that the preprocessing is memory bound, not compute bound. So there will be little gain in using threads unless you have a very fast storage.