I have created Ubuntu WSL on my Windows setup with
python3 == 3.8.5
pip3 == 20.0.2
perl == 5.30.0 (this is used in training shell script therefore I’m mentioning it is weird to use perl in 2021)
opennmt-py == 2.0.1
Run this command
onmt_build_vocab -config wmt14_en_de.yaml -n_sample -1
Prints this error message:
OSError: Not found: “data/wmt/wmtende.model”: No such file or directory Error #2
But these files do not exist in my folder. I searched the shell script with .model but all of those lines are commented out. Lines from 115 to 146 and Line 100 also has if false too probably this script is not applying sentencepiece neither possibly included via OpenNMT ?
Can you check the tutorial and shell script? I guess there are some missing points.
My other personal question: Is there a specific reason to use a shell script for this preprocessing purpose? As I understand this shell script is
Downloading files (wget) with target and source to a folder
Deletes unnecessary language pairs except de-en
Concatenating them ? (there is if false on line 100)
applies sentencepiece (which might be done via python may be also this section is commented out too)
parsing sgm files with a perl script ? (this perl file is only reading lines with regex check also doable for python too)
If I understand the steps I can also help you to write a python script, which downloads these files, apply regex to sgm files via python and will be OS independent.
It’s been 14 hours and still in “making suffix array” phase. I will create a request in library’s repo may be they can improve something with CPU threads, or use CUDA etc.
For information, the running phase finished today. But it took more than 2 days for sure. Since it is not writing the timestamp information can not give the exact time. But at least 50 hours I can say.
The model and vocab file is just 798KB and 589KB.
You should add this information to the tutorial so people will not expect a quick run from the shell script, may be just add these files and give information like.
“You can run this script and wait for 40-50 hours to build model and vocab files from 9M sentences or just download these two files and continue with training etc”.
Thank you all for your help for now. I will continue with training phase and see what error messages I will encounter in the future
This is not the proper way to work with the script. Most example NMT training pipelines assume that the user has a strong machine that can handle the workload. Otherwise, the user should adapt, if possible, the scripts to their user-case and not use them blindly without checking with their machine specs.
In this case, you should use the --input_sentence_size option in the sentencepiece section of the script to decrease the number of sentences fed to sentencepiece so they can fit in your available RAM:
Try to experiment with a value that will fit the sentences in your RAM without using swapping --that way, training should take only a few minutes. With this lower value the sentencepiece model will probably not cover all subwords in the corpus, but at least the process should be able to finish in a reasonable amount of time.
Yes sure, I will add some notes both in quick start
Don’t expect good results from 30K sentences, the translation file will look like Das die, Hotel etc and it is normal
Sentencepiece for 9M sentences takes about 29GB of RAM, you may create swap or change spm_train with the command below:
spm_train --input=$DATA_PATH/train.txt --model_prefix=$DATA_PATH/wmt$sl$tl
–vocab_size=$vocab_size --character_coverage=1 --input_sentence_size=1000000