OpenNMT Forum

AssertionError in Preprocessing Script

I am trying to add preprocess WMT-14 English- French dataset. Previously I applied BPE on this dataset.

Command I used:

python3 -train_src data/BPE/train.src -train_tgt data/BPE/train.tgt -valid_src data/BPE/val.src -valid_tgt data/BPE/val.tgt -save_data data/BPE/en_fr -src_vocab_size 100000 -tgt_vocab_size 100000 -src_seq_length 128 -tgt_seq_length 128

[2020-07-11 13:38:09,867 INFO] * saving 33th train data shard to data/BPE/
[2020-07-11 13:38:55,736 INFO] Building shard 34.
[2020-07-11 13:39:40,951 INFO] * saving 34th train data shard to data/BPE/
Traceback (most recent call last):
File “/usr/lib/python3.6/multiprocessing/”, line 119, in worker
result = (True, func(*args, **kwds))
File “/home/translateme/Documents/Exp3/onmt/bin/”, line 54, in process_one_shard
assert len(src_shard) == len(tgt_shard)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “”, line 6, in
File “/home/translateme/Documents/Exp3/onmt/bin/”, line 318, in main
File “/home/translateme/Documents/Exp3/onmt/bin/”, line 298, in preprocess
‘train’, fields, src_reader, tgt_reader, align_reader, opt)
File “/home/translateme/Documents/Exp3/onmt/bin/”, line 205, in build_save_dataset
for sub_counter in p.imap(func, shard_iter):
File “/usr/lib/python3.6/multiprocessing/”, line 735, in next
raise value

Any suggestion on how should I resolve this. My dataset has approx 34M parallel sentences.

many thanks in advance.

There probably is some kind of alignment mismatch between your source and target. Do both sides have the exact same number of lines?