I’ve got the following error when trying to preprocess my data for training:

[COMMAND] onmt_preprocess -train_src train/corpus.train.src.tok -train_tgt train/corpus.train.tgt.tok -valid_src dev/ -valid_tgt dev/ -save_data preprocess/preprocessed --num_threads 2  --src_seq_length 150 --tgt_seq_length 150
[2020-09-04 15:56:51,394 INFO] Extracting features...
[2020-09-04 15:56:51,394 INFO]  * number of source features: 0.
[2020-09-04 15:56:51,394 INFO]  * number of target features: 0.
[2020-09-04 15:56:51,394 INFO] Building `Fields` object...
[2020-09-04 15:56:51,394 INFO] Building & saving training data...
Traceback (most recent call last):
  File "run/venv/bin/onmt_preprocess", line 11, in <module>
    load_entry_point('OpenNMT-py==1.1.1', 'console_scripts', 'onmt_preprocess')()
  File "run/venv/lib/python3.6/site-packages/OpenNMT_py-1.1.1-py3.6.egg/onmt/bin/", line 318, in main
  File "run/venv/lib/python3.6/site-packages/OpenNMT_py-1.1.1-py3.6.egg/onmt/bin/", line 298, in preprocess
    'train', fields, src_reader, tgt_reader, align_reader, opt)
  File "run/venv/lib/python3.6/site-packages/OpenNMT_py-1.1.1-py3.6.egg/onmt/bin/", line 205, in build_save_dataset
    for sub_counter in p.imap(func, shard_iter):
  File "/usr/lib/python3.6/multiprocessing/", line 735, in next
    raise value
  File "/usr/lib/python3.6/multiprocessing/", line 424, in _handle_tasks
  File "/usr/lib/python3.6/multiprocessing/", line 206, in send
  File "/usr/lib/python3.6/multiprocessing/", line 393, in _send_bytes
    header = struct.pack("!i", n)
**struct.error: 'i' format requires -2147483648 <= number <= 2147483647**
Traceback (most recent call last):
  File "../run/", line 293, in <module>
    raise Exception("There was an error preprocessing data")    
Exception: There was an error preprocessing dataç

Any clue of what’s happening here?

Solved! Seems an issue when to many ⦅ characters are present in a sentence. Somehow those sentences are too big to pickle.