Could you please give some details about your file size and your machine specifications.
Last time I got such “killed” message, this was due to memory (RAM) limit exceeding. Sometimes, rebooting helps; however, when the available memory is really limited compared to the required resources, the only solution is to increase it.
Note that you can check kernel termination errors at: /var/log/kern.log or using the command: dmesg
Dear Yasmin
Thanks for the fast reply. The data is big, it is almost 500 millions sentences. The computer has a GPU and 64 GB of memory. I checked the log and indeed it seems to be a memory problem.
[1030874.573209] Out of memory: Kill process 10589 (python) score 768 or sacrifice child
[1030874.573214] Killed process 10589 (python) total-vm:53576540kB, anon-rss:52157940kB, file-rss:32kB, shmem-rss:0kB
I’ve tried to run the preprocess.py script with the -shard_size 1000000 and I got a segmentation fault error
The default shard_size is 1000000. Have you tried other values like 2000000.
If this does not work, you can run an “Edit Distance” algorithm on the original files and exclude sentences with a very small Edit Distance (keep only one of them). Those sentences most likely will be diffrent by a comma or something trivial. This can reduce your dataset size while keeping the necessary data.
There are other approaches like processing and training only part of your data and then applying retraining / incremental training with the new data, updating vocabulary, but this is better done with the TensorFlow version which has the update vocabulary option. The forum has several posts about this approach. Still, I recommend trying to clean your data with the Edit Distance approach I mentioned above first.