OpenNMT Forum

Missing 'data/demo.vocab.pt'

Hello ! I am try to train my own dataset, but after preprocessioning I can’t seem to get a ‘data/demo.vocab.pt’ from the dataset. I am also getting an assertion error. What am I doing wrong?

    [2019-06-04 18:56:09,553 INFO] Extracting features...
    [2019-06-04 18:56:09,597 INFO]  * number of source features: 0.
    [2019-06-04 18:56:09,597 INFO]  * number of target features: 0.
    [2019-06-04 18:56:09,597 INFO] Building `Fields` object...
    [2019-06-04 18:56:09,598 INFO] Building & saving training data...
    [2019-06-04 18:56:09,598 INFO] Reading source and target files: data/src-train.txt data/tgt-train.txt.
    [2019-06-04 18:56:11,381 INFO] Building shard 0.
    [2019-06-04 18:56:37,781 INFO]  * saving 0th train data shard to data/demo.train.0.pt.
    [2019-06-04 18:57:06,543 INFO] Building shard 1.
    [2019-06-04 18:57:34,903 INFO]  * saving 1th train data shard to data/demo.train.1.pt.
    [2019-06-04 18:58:04,455 INFO] Building shard 2.
    [2019-06-04 18:58:32,880 INFO]  * saving 2th train data shard to data/demo.train.2.pt.
    [2019-06-04 18:59:02,682 INFO] Building shard 3.
    [2019-06-04 18:59:32,552 INFO]  * saving 3th train data shard to data/demo.train.3.pt.
    [2019-06-04 19:00:02,571 INFO] Building shard 4.
    [2019-06-04 19:00:32,264 INFO]  * saving 4th train data shard to data/demo.train.4.pt.
    [2019-06-04 19:01:01,782 INFO] Building shard 5.
    [2019-06-04 19:01:31,756 INFO]  * saving 5th train data shard to data/demo.train.5.pt.
    [2019-06-04 19:02:01,825 INFO] Building shard 6.
    [2019-06-04 19:02:33,348 INFO]  * saving 6th train data shard to data/demo.train.6.pt.
    [2019-06-04 19:03:02,821 INFO] Building shard 7.
    [2019-06-04 19:03:38,801 INFO]  * saving 7th train data shard to data/demo.train.7.pt.
    [2019-06-04 19:04:08,545 INFO] Building shard 8.
    [2019-06-04 19:04:39,184 INFO]  * saving 8th train data shard to data/demo.train.8.pt.
    [2019-06-04 19:05:08,535 INFO] Building shard 9.
    [2019-06-04 19:05:40,973 INFO]  * saving 9th train data shard to data/demo.train.9.pt.
    Traceback (most recent call last):
      File "preprocess.py", line 217, in <module>
        main(opt)
      File "preprocess.py", line 198, in main
        'train', fields, src_reader, tgt_reader, opt)
      File "preprocess.py", line 83, in build_save_dataset
        assert len(src_shard) == len(tgt_shard)
    AssertionError

check the number of lines on each side of the dataset

I think the number of source and target sentences is different.
You need to check the number of line using

wc -l *

Thanks! I think my data preparation was incorrect

Thank you!