Hello Everyone,
I am really glad to find this community,
I am an Arabic<>English translator. I have many sets of data, I have built them throughout my years of work. I am trying to build a system based on the data set that I have. However, I am facing some technical issues and I would like you to help and guide me with the steps to perform the process successfully.
I am using:
1- windows 10
2- Miniconda - python 3.6
3- Using Pytorch
4- Git terminal for windows.
In the sample which is given with the project, I followed the steps and it worked successfully until the end, and I got English<>German system.
The Problem:
When I tried to use my data set I prepared my own files:-
1- src-train.txt
2- tgt-train.txt
3- src-val.txt
4- tgt-val.txt
5- src-test.txt
(for each pair, the number of lines are the same)
Then, I replaced the files and trying to preprocess the data with this command
python preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo
I get this Error:
[INFO] Extracting features...
[INFO] * number of source features: 0.
[494 INFO] * number of target features: 0.
[494 INFO] Building `Fields` object...
[INFO] Building & saving training data...
[INFO] * saving train data shard to data/demo.train.1.pt.
[INFO] Building & saving vocabulary...
[INFO] * reloading data/demo.train.1.pt.
[INFO] * tgt vocab size: 2254.
[INFO] * src vocab size: 4251.
[INFO] Building & saving validation data...
Traceback (most recent call last):
File "preprocess.py", line 211, in <module>
main()
File "preprocess.py", line 207, in main
build_save_dataset('valid', fields, opt)
File "preprocess.py", line 138, in build_save_dataset
corpus_type, opt)
File "preprocess.py", line 107, in build_save_in_shards
dynamic_dict=opt.dynamic_dict)
File "C:\Users\user\OpenNMT-py\onmt\inputters\text_dataset.py", line 79, in __init__
for ex_values in example_values:
File "C:\Users\user\OpenNMT-py\onmt\inputters\text_dataset.py", line 71, in <genexpr>
example_values = ([ex[k] for k in keys] for ex in examples_iter)
File "C:\Users\user\OpenNMT-py\onmt\inputters\text_dataset.py", line 57, in <genexpr>
examples_iter = (self._join_dicts(src, tgt) for src, tgt in
File "C:\Users\user\OpenNMT-py\onmt\inputters\text_dataset.py", line 356, in __iter__
"Two corpuses must have same number of lines!")
AssertionError: Two corpuses must have same number of lines!
your help in this concern is appreciated,
Thanks.