WMT14 Translation Script

Hello Everyone,

I was following this tutorial here: Translation — OpenNMT-py documentation

I have created Ubuntu WSL on my Windows setup with

python3 == 3.8.5
pip3 == 20.0.2
perl == 5.30.0 (this is used in training shell script therefore I’m mentioning it is weird to use perl in 2021)
opennmt-py == 2.0.1

  1. wget this shell script: OpenNMT-py/prepare_wmt_data.sh at master · OpenNMT/OpenNMT-py · GitHub

  2. Run these commands:

chmod u+x prepare_wmt_data.sh
./prepare_wmt_data.sh data/wmt (by the way the tutorial does not say this script needs path arg)

So after a while the dataset is retrieved, but in this yaml config

There are two definitions:

src_subword_model: data/wmt/wmtende.model
tgt_subword_model: data/wmt/wmtende.model

Run this command
onmt_build_vocab -config wmt14_en_de.yaml -n_sample -1

Prints this error message:
OSError: Not found: “data/wmt/wmtende.model”: No such file or directory Error #2

But these files do not exist in my folder. I searched the shell script with .model but all of those lines are commented out. Lines from 115 to 146 and Line 100 also has if false too probably this script is not applying sentencepiece neither possibly included via OpenNMT ?

Can you check the tutorial and shell script? I guess there are some missing points.

My other personal question: Is there a specific reason to use a shell script for this preprocessing purpose? As I understand this shell script is

  1. Downloading files (wget) with target and source to a folder
  2. Deletes unnecessary language pairs except de-en
  3. Concatenating them ? (there is if false on line 100)
  4. applies sentencepiece (which might be done via python may be also this section is commented out too)
  5. parsing sgm files with a perl script ? (this perl file is only reading lines with regex check also doable for python too)

If I understand the steps I can also help you to write a python script, which downloads these files, apply regex to sgm files via python and will be OS independent.

Thank you.

The file you’re referring to is the sentencepiece model.
It’s supposed to be created here:

Not sure why we set this to false by default though.

My other personal question: Is there a specific reason to use a shell script for this preprocessing purpose? As I understand this shell script is

You can do these steps however you like. It’s just an example to enable people to get started fairly easily.

If I enable that “false” block this script should work then right?

Thank you for information I’ll give it a try and share the results.

That worked, hope you update the script for people like me :slight_smile:

I know sentenpiece is not a library of yours, but is it normal to use 29GB of RAM and 1 CPU core for this library for modeling 9.1M sentences?

My logs:

trainer_interface.cc(267) LOG(INFO) Loading corpus: data/train.txt
trainer_interface.cc(287) LOG(WARNING) Found too long line (25081 > 4192).
trainer_interface.cc(289) LOG(WARNING) Too long lines are skipped in the training.
trainer_interface.cc(290) LOG(WARNING) The maximum length can be changed with --max_sentence_length= flag.
trainer_interface.cc(139) LOG(INFO) Loaded 1000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 2000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 3000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 4000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 5000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 6000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 7000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 8000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 9000000 lines
trainer_interface.cc(114) LOG(WARNING) Too many sentences are loaded! (9110889), which may slow down training.
trainer_interface.cc(116) LOG(WARNING) Consider using --input_sentence_size= and --shuffle_input_sentence=true.
trainer_interface.cc(119) LOG(WARNING) They allow to randomly sample sentences from the entire corpus.
trainer_interface.cc(315) LOG(INFO) Loaded all 9110889 sentences
trainer_interface.cc(321) LOG(INFO) Skipped 98 too long sentences.
trainer_interface.cc(330) LOG(INFO) Adding meta_piece:
trainer_interface.cc(330) LOG(INFO) Adding meta_piece:
trainer_interface.cc(330) LOG(INFO) Adding meta_piece:

trainer_interface.cc(335) LOG(INFO) Normalizing sentences…
trainer_interface.cc(384) LOG(INFO) all chars count=1331305324
trainer_interface.cc(392) LOG(INFO) Done: 100% characters are covered.
trainer_interface.cc(402) LOG(INFO) Alphabet size=3502
trainer_interface.cc(403) LOG(INFO) Final character coverage=1
trainer_interface.cc(435) LOG(INFO) Done! preprocessed 9110889 sentences.
unigram_model_trainer.cc(129) LOG(INFO) Making suffix array…


Yes, it’s perfectly normal with so many sentences. Also note that at some point more cores are used (I think after making the suffix array).

Thank you panosk,

Hopefully it will use multi cores in next phases :slight_smile:

It’s been 14 hours and still in “making suffix array” phase. I will create a request in library’s repo may be they can improve something with CPU threads, or use CUDA etc.

This, however, is not normal. Do you have that much RAM or swapping is used?

I’m running on my Laptop, 6GB is actual, 23G is swap.

I started with 8G,16G swaps and they “killed” the process printing errors.

Now it’s running. Here is my actual screenshot

For information, the running phase finished today. But it took more than 2 days for sure. Since it is not writing the timestamp information can not give the exact time. But at least 50 hours I can say.

The model and vocab file is just 798KB and 589KB.

You should add this information to the tutorial so people will not expect a quick run from the shell script, may be just add these files and give information like.

“You can run this script and wait for 40-50 hours to build model and vocab files from 9M sentences or just download these two files and continue with training etc”.

Thank you all for your help for now. I will continue with training phase and see what error messages I will encounter in the future :grin:

This is not the proper way to work with the script. Most example NMT training pipelines assume that the user has a strong machine that can handle the workload. Otherwise, the user should adapt, if possible, the scripts to their user-case and not use them blindly without checking with their machine specs.
In this case, you should use the --input_sentence_size option in the sentencepiece section of the script to decrease the number of sentences fed to sentencepiece so they can fit in your available RAM:

spm_train --input=$DATA_PATH/train.txt --model_prefix=$DATA_PATH/wmt$sl$tl \
           --vocab_size=$vocab_size --character_coverage=1 --input_sentence_size=1000000

Try to experiment with a value that will fit the sentences in your RAM without using swapping --that way, training should take only a few minutes. With this lower value the sentencepiece model will probably not cover all subwords in the corpus, but at least the process should be able to finish in a reasonable amount of time.

1 Like

Yes I learned it with a hard way.

Nothing about server spec is written on the Tutorial page

or Quickstart page.

or Github page

Also there is a “false” block which is required for the shell script itself as françois said in the above reply too.

You should mention these in somewhere in the website and modify the shell script so people can have at least some knowledge about real world examples. :grin:

No doc is perfect. Feel free to PR some changes.

Yes sure, I will add some notes both in quick start

  1. Don’t expect good results from 30K sentences, the translation file will look like Das die, Hotel etc and it is normal
  2. Sentencepiece for 9M sentences takes about 29GB of RAM, you may create swap or change spm_train with the command below:
    spm_train --input=$DATA_PATH/train.txt --model_prefix=$DATA_PATH/wmt$sl$tl
    –vocab_size=$vocab_size --character_coverage=1 --input_sentence_size=1000000