Killed error while running preprocess.py

cmatosv · June 18, 2019, 1:40pm

Hello

I’m getting an Killed error while running preprocess.py script. This is the error

[2019-06-18 15:18:14,683 INFO] Extracting features...

[2019-06-18 15:18:14,775 INFO] * number of source features: 0.

[2019-06-18 15:18:14,775 INFO] * number of target features: 0.

[2019-06-18 15:18:14,775 INFO] Building `Fields` object...

[2019-06-18 15:18:14,776 INFO] Building & saving training data...

[2019-06-18 15:18:14,776 INFO] Reading source and target files: OpenNMT-py/ext_data/data_emb/amica/amica_to_annotate.xlsx.corpus_ext.train.ori OpenNMT-py/ext_data/data_emb/amica/amica_to_annotate.xlsx.corpus_ext.train.tgt

Killed

I also wander why source and target vocabularies features are 0?

I hope you can help me
Thanks in advance

Claudia

ymoslem · June 18, 2019, 2:27pm

Dear Claudia,

Could you please give some details about your file size and your machine specifications.

Last time I got such “killed” message, this was due to memory (RAM) limit exceeding. Sometimes, rebooting helps; however, when the available memory is really limited compared to the required resources, the only solution is to increase it.

Note that you can check kernel termination errors at: /var/log/kern.log or using the command: dmesg

Kind regards,
Yasmin

cmatosv · June 18, 2019, 2:35pm

Dear Yasmin
Thanks for the fast reply. The data is big, it is almost 500 millions sentences. The computer has a GPU and 64 GB of memory. I checked the log and indeed it seems to be a memory problem.

[1030874.573209] Out of memory: Kill process 10589 (python) score 768 or sacrifice child

[1030874.573214] Killed process 10589 (python) total-vm:53576540kB, anon-rss:52157940kB, file-rss:32kB, shmem-rss:0kB

I’ve tried to run the preprocess.py script with the -shard_size 1000000 and I got a segmentation fault error

Regards
Claudia

ymoslem · June 18, 2019, 3:01pm

Dear Claudia,

This is really a big dataset. Does it include duplicates?

Kind regards,
Yasmin

cmatosv · June 18, 2019, 3:03pm

No, there are different sentences

ymoslem · June 18, 2019, 4:33pm

Dear Claudia,

The default shard_size is 1000000. Have you tried other values like 2000000.

If this does not work, you can run an “Edit Distance” algorithm on the original files and exclude sentences with a very small Edit Distance (keep only one of them). Those sentences most likely will be diffrent by a comma or something trivial. This can reduce your dataset size while keeping the necessary data.

There are other approaches like processing and training only part of your data and then applying retraining / incremental training with the new data, updating vocabulary, but this is better done with the TensorFlow version which has the update vocabulary option. The forum has several posts about this approach. Still, I recommend trying to clean your data with the Edit Distance approach I mentioned above first.

Kind regards,
Yasmin

vince62s · June 18, 2019, 6:23pm

with shard_size 1000000 you should not get a segfault.
are you using python2.7 or 3 ?

cmatosv · June 19, 2019, 12:29pm

Ok, I’ll check that. Thanks!!

cmatosv · June 19, 2019, 12:29pm

I’m using python 3

cmatosv · June 26, 2019, 12:46pm

Hello. Do you know if the update_vocabulary option is available in the Lua version or in the opennmt-py version?

ymoslem · June 26, 2019, 6:39pm

Dear Claudia,

-update_vocab is in the TensorFlow and Lua versions, but not in the PyTorch version of OpenNMT.

Have you managed to solve the big file issue? Have you tried the Edit Distance suggestion?

Kind regards,
Yasmin