Killed error while running preprocess.py

Hello

I’m getting an Killed error while running preprocess.py script. This is the error

[2019-06-18 15:18:14,683 INFO] Extracting features...

[2019-06-18 15:18:14,775 INFO] * number of source features: 0.

[2019-06-18 15:18:14,775 INFO] * number of target features: 0.

[2019-06-18 15:18:14,775 INFO] Building `Fields` object...

[2019-06-18 15:18:14,776 INFO] Building & saving training data...

[2019-06-18 15:18:14,776 INFO] Reading source and target files: OpenNMT-py/ext_data/data_emb/amica/amica_to_annotate.xlsx.corpus_ext.train.ori OpenNMT-py/ext_data/data_emb/amica/amica_to_annotate.xlsx.corpus_ext.train.tgt

Killed

I also wander why source and target vocabularies features are 0?

I hope you can help me
Thanks in advance

Claudia

Dear Claudia,

Could you please give some details about your file size and your machine specifications.

Last time I got such “killed” message, this was due to memory (RAM) limit exceeding. Sometimes, rebooting helps; however, when the available memory is really limited compared to the required resources, the only solution is to increase it.

Note that you can check kernel termination errors at: /var/log/kern.log or using the command: dmesg

Kind regards,
Yasmin

Dear Yasmin
Thanks for the fast reply. The data is big, it is almost 500 millions sentences. The computer has a GPU and 64 GB of memory. I checked the log and indeed it seems to be a memory problem.

[1030874.573209] Out of memory: Kill process 10589 (python) score 768 or sacrifice child

[1030874.573214] Killed process 10589 (python) total-vm:53576540kB, anon-rss:52157940kB, file-rss:32kB, shmem-rss:0kB

I’ve tried to run the preprocess.py script with the -shard_size 1000000 and I got a segmentation fault error

Regards
Claudia

Dear Claudia,

This is really a big dataset. Does it include duplicates?

Kind regards,
Yasmin

No, there are different sentences

Dear Claudia,

The default shard_size is 1000000. Have you tried other values like 2000000.

If this does not work, you can run an “Edit Distance” algorithm on the original files and exclude sentences with a very small Edit Distance (keep only one of them). Those sentences most likely will be diffrent by a comma or something trivial. This can reduce your dataset size while keeping the necessary data.

There are other approaches like processing and training only part of your data and then applying retraining / incremental training with the new data, updating vocabulary, but this is better done with the TensorFlow version which has the update vocabulary option. The forum has several posts about this approach. Still, I recommend trying to clean your data with the Edit Distance approach I mentioned above first.

Kind regards,
Yasmin

with shard_size 1000000 you should not get a segfault.
are you using python2.7 or 3 ?

Ok, I’ll check that. Thanks!!

I’m using python 3

Hello. Do you know if the update_vocabulary option is available in the Lua version or in the opennmt-py version?

Dear Claudia,

-update_vocab is in the TensorFlow and Lua versions, but not in the PyTorch version of OpenNMT.

Have you managed to solve the big file issue? Have you tried the Edit Distance suggestion?

Kind regards,
Yasmin

1 Like