OpenNMT Forum

Preprocessing strange error

Hi,

I’ve got the following error when trying to preprocess my data for training:

[COMMAND] onmt_preprocess -train_src train/corpus.train.src.tok -train_tgt train/corpus.train.tgt.tok -valid_src dev/corpus.dev.src.tok -valid_tgt dev/corpus.dev.tgt.tok -save_data preprocess/preprocessed --num_threads 2  --src_seq_length 150 --tgt_seq_length 150
[2020-09-04 15:56:51,394 INFO] Extracting features...
[2020-09-04 15:56:51,394 INFO]  * number of source features: 0.
[2020-09-04 15:56:51,394 INFO]  * number of target features: 0.
[2020-09-04 15:56:51,394 INFO] Building `Fields` object...
[2020-09-04 15:56:51,394 INFO] Building & saving training data...
Traceback (most recent call last):
  File "run/venv/bin/onmt_preprocess", line 11, in <module>
    load_entry_point('OpenNMT-py==1.1.1', 'console_scripts', 'onmt_preprocess')()
  File "run/venv/lib/python3.6/site-packages/OpenNMT_py-1.1.1-py3.6.egg/onmt/bin/preprocess.py", line 318, in main
    preprocess(opt)
  File "run/venv/lib/python3.6/site-packages/OpenNMT_py-1.1.1-py3.6.egg/onmt/bin/preprocess.py", line 298, in preprocess
    'train', fields, src_reader, tgt_reader, align_reader, opt)
  File "run/venv/lib/python3.6/site-packages/OpenNMT_py-1.1.1-py3.6.egg/onmt/bin/preprocess.py", line 205, in build_save_dataset
    for sub_counter in p.imap(func, shard_iter):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 735, in next
    raise value
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
    put(task)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
**struct.error: 'i' format requires -2147483648 <= number <= 2147483647**
Traceback (most recent call last):
  File "../run/run.py", line 293, in <module>
    raise Exception("There was an error preprocessing data")    
Exception: There was an error preprocessing dataç

Any clue of what’s happening here?

Solved! Seems an issue when to many ⦅ characters are present in a sentence. Somehow those sentences are too big to pickle.