AssertionError in Preprocessing Script

Rishi · July 11, 2020, 5:01pm

I am trying to add preprocess WMT-14 English- French dataset. Previously I applied BPE on this dataset.

Command I used:

python3 preprocess.py -train_src data/BPE/train.src -train_tgt data/BPE/train.tgt -valid_src data/BPE/val.src -valid_tgt data/BPE/val.tgt -save_data data/BPE/en_fr -src_vocab_size 100000 -tgt_vocab_size 100000 -src_seq_length 128 -tgt_seq_length 128

[2020-07-11 13:38:09,867 INFO] * saving 33th train data shard to data/BPE/en_fr.train.33.pt.
[2020-07-11 13:38:55,736 INFO] Building shard 34.
[2020-07-11 13:39:40,951 INFO] * saving 34th train data shard to data/BPE/en_fr.train.34.pt.
multiprocessing.pool.RemoteTraceback:
“”"
Traceback (most recent call last):
File “/usr/lib/python3.6/multiprocessing/pool.py”, line 119, in worker
result = (True, func(*args, **kwds))
File “/home/translateme/Documents/Exp3/onmt/bin/preprocess.py”, line 54, in process_one_shard
assert len(src_shard) == len(tgt_shard)
AssertionError
“”"

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “preprocess.py”, line 6, in
main()
File “/home/translateme/Documents/Exp3/onmt/bin/preprocess.py”, line 318, in main
preprocess(opt)
File “/home/translateme/Documents/Exp3/onmt/bin/preprocess.py”, line 298, in preprocess
‘train’, fields, src_reader, tgt_reader, align_reader, opt)
File “/home/translateme/Documents/Exp3/onmt/bin/preprocess.py”, line 205, in build_save_dataset
for sub_counter in p.imap(func, shard_iter):
File “/usr/lib/python3.6/multiprocessing/pool.py”, line 735, in next
raise value
AssertionError

Any suggestion on how should I resolve this. My dataset has approx 34M parallel sentences.

many thanks in advance.

francoishernandez · July 15, 2020, 9:16am

There probably is some kind of alignment mismatch between your source and target. Do both sides have the exact same number of lines?