Need Help regarding Data Preparation

Hello Everyone,

I am really glad to find this community,

I am an Arabic<>English translator. I have many sets of data, I have built them throughout my years of work. I am trying to build a system based on the data set that I have. However, I am facing some technical issues and I would like you to help and guide me with the steps to perform the process successfully.

I am using:
1- windows 10
2- Miniconda - python 3.6
3- Using Pytorch
4- Git terminal for windows.

In the sample which is given with the project, I followed the steps and it worked successfully until the end, and I got English<>German system.

The Problem:

When I tried to use my data set I prepared my own files:-

1- src-train.txt
2- tgt-train.txt
3- src-val.txt
4- tgt-val.txt
5- src-test.txt

(for each pair, the number of lines are the same)

Then, I replaced the files and trying to preprocess the data with this command

python preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/demo

I get this Error:

[INFO] Extracting features...
[INFO]  * number of source features: 0.
[494 INFO]  * number of target features: 0.
[494 INFO] Building `Fields` object...
[INFO] Building & saving training data...
[INFO]  * saving train data shard to data/demo.train.1.pt.
[INFO] Building & saving vocabulary...
[INFO]  * reloading data/demo.train.1.pt.
[INFO]  * tgt vocab size: 2254.
[INFO]  * src vocab size: 4251.
[INFO] Building & saving validation data...
Traceback (most recent call last):
  File "preprocess.py", line 211, in <module>
    main()
  File "preprocess.py", line 207, in main
    build_save_dataset('valid', fields, opt)
  File "preprocess.py", line 138, in build_save_dataset
    corpus_type, opt)
  File "preprocess.py", line 107, in build_save_in_shards
    dynamic_dict=opt.dynamic_dict)
  File "C:\Users\user\OpenNMT-py\onmt\inputters\text_dataset.py", line 79, in __init__
    for ex_values in example_values:
  File "C:\Users\user\OpenNMT-py\onmt\inputters\text_dataset.py", line 71, in <genexpr>
    example_values = ([ex[k] for k in keys] for ex in examples_iter)
  File "C:\Users\user\OpenNMT-py\onmt\inputters\text_dataset.py", line 57, in <genexpr>
    examples_iter = (self._join_dicts(src, tgt) for src, tgt in
  File "C:\Users\user\OpenNMT-py\onmt\inputters\text_dataset.py", line 356, in __iter__
    "Two corpuses must have same number of lines!")
AssertionError: Two corpuses must have same number of lines!

your help in this concern is appreciated,
Thanks.

Hello @Remo2010, it is hard to analyze without a minimal corpus on which we can reproduce. Since the error happens on validation dataset - can you just check by running the same command using your validation set for train/test and valid. To narrow down further, you can also check if the issue comes from source or target. And when you have narrow down enough, can you open an issue on github providing the file so that we can reproduce?
My wild guess is that your corpus contains unicode newline characters 2028, 2029, or 0085. These characters are often problematic - and depending on the way the file is opened in python, they can be interpreted or not as newlines.

1 Like

Thanks a lot @jean.senellart, you are right, I missed some Unicode chars in my file; they were responsible for this error. So, I have made 5 lines in all files, then I ran the function, it worked normally I did not have any issues.

I worked on it and it works now.

I am really grateful for your help.

Hello @Remo2010, I am also working on Arabic<–>English translation. It’s good to see that OpenNMT code is working for you. Would it be possible for you to share training data? Looking forward to hearing from you soon.

Thanks,
Mohsin