I am really glad to find this community,
I am an Arabic<>English translator. I have many sets of data, I have built them throughout my years of work. I am trying to build a system based on the data set that I have. However, I am facing some technical issues and I would like you to help and guide me with the steps to perform the process successfully.
I am using:
1- windows 10
2- Miniconda - python 3.6
3- Using Pytorch
4- Git terminal for windows.
In the sample which is given with the project, I followed the steps and it worked successfully until the end, and I got English<>German system.
When I tried to use my data set I prepared my own files:-
(for each pair, the number of lines are the same)
Then, I replaced the files and trying to preprocess the data with this command
python preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo
I get this Error:
[INFO] Extracting features... [INFO] * number of source features: 0. [494 INFO] * number of target features: 0. [494 INFO] Building `Fields` object... [INFO] Building & saving training data... [INFO] * saving train data shard to data/demo.train.1.pt. [INFO] Building & saving vocabulary... [INFO] * reloading data/demo.train.1.pt. [INFO] * tgt vocab size: 2254. [INFO] * src vocab size: 4251. [INFO] Building & saving validation data... Traceback (most recent call last): File "preprocess.py", line 211, in <module> main() File "preprocess.py", line 207, in main build_save_dataset('valid', fields, opt) File "preprocess.py", line 138, in build_save_dataset corpus_type, opt) File "preprocess.py", line 107, in build_save_in_shards dynamic_dict=opt.dynamic_dict) File "C:\Users\user\OpenNMT-py\onmt\inputters\text_dataset.py", line 79, in __init__ for ex_values in example_values: File "C:\Users\user\OpenNMT-py\onmt\inputters\text_dataset.py", line 71, in <genexpr> example_values = ([ex[k] for k in keys] for ex in examples_iter) File "C:\Users\user\OpenNMT-py\onmt\inputters\text_dataset.py", line 57, in <genexpr> examples_iter = (self._join_dicts(src, tgt) for src, tgt in File "C:\Users\user\OpenNMT-py\onmt\inputters\text_dataset.py", line 356, in __iter__ "Two corpuses must have same number of lines!") AssertionError: Two corpuses must have same number of lines!
your help in this concern is appreciated,