WMT14 Translation Script

chopinml · April 1, 2021, 2:26pm

Hello Everyone,

I was following this tutorial here: Translation — OpenNMT-py documentation

I have created Ubuntu WSL on my Windows setup with

python3 == 3.8.5
pip3 == 20.0.2
perl == 5.30.0 (this is used in training shell script therefore I’m mentioning it is weird to use perl in 2021)
opennmt-py == 2.0.1

wget this shell script: OpenNMT-py/prepare_wmt_data.sh at master · OpenNMT/OpenNMT-py · GitHub
Run these commands:

chmod u+x prepare_wmt_data.sh
./prepare_wmt_data.sh data/wmt (by the way the tutorial does not say this script needs path arg)

So after a while the dataset is retrieved, but in this yaml config
https://opennmt.net/OpenNMT-py/examples/Translation.html#step-1-build-the-vocabulary

There are two definitions:

src_subword_model: data/wmt/wmtende.model
tgt_subword_model: data/wmt/wmtende.model

Run this command
onmt_build_vocab -config wmt14_en_de.yaml -n_sample -1

Prints this error message:
OSError: Not found: “data/wmt/wmtende.model”: No such file or directory Error #2

But these files do not exist in my folder. I searched the shell script with .model but all of those lines are commented out. Lines from 115 to 146 and Line 100 also has if false too probably this script is not applying sentencepiece neither possibly included via OpenNMT ?

Can you check the tutorial and shell script? I guess there are some missing points.

My other personal question: Is there a specific reason to use a shell script for this preprocessing purpose? As I understand this shell script is

Downloading files (wget) with target and source to a folder
Deletes unnecessary language pairs except de-en
Concatenating them ? (there is if false on line 100)
applies sentencepiece (which might be done via python may be also this section is commented out too)
parsing sgm files with a perl script ? (this perl file is only reading lines with regex check also doable for python too)

If I understand the steps I can also help you to write a python script, which downloads these files, apply regex to sgm files via python and will be OS independent.

Thank you.

francoishernandez · April 1, 2021, 3:04pm

The file you’re referring to is the sentencepiece model.
It’s supposed to be created here:

github.com

OpenNMT/OpenNMT-py/blob/94d96076b283a4b66c4c32e9db6ded13c511be56/examples/scripts/prepare_wmt_data.sh#L100-L113


if false; then
 echo "$0: Training sentencepiece model"
 rm -f $DATA_PATH/train.txt
 for ((i=1; i<= ${#corpus[@]}; i++))
 do
  for f in $DATA_PATH/${corpus[$i]}.$sl $DATA_PATH/${corpus[$i]}.$tl
   do
    cat $f >> $DATA_PATH/train.txt
   done
 done
 spm_train --input=$DATA_PATH/train.txt --model_prefix=$DATA_PATH/wmt$sl$tl \
           --vocab_size=$vocab_size --character_coverage=1
 rm $DATA_PATH/train.txt
fi

Not sure why we set this to false by default though.

My other personal question: Is there a specific reason to use a shell script for this preprocessing purpose? As I understand this shell script is

You can do these steps however you like. It’s just an example to enable people to get started fairly easily.

chopinml · April 1, 2021, 3:55pm

If I enable that “false” block this script should work then right?

Thank you for information I’ll give it a try and share the results.

chopinml · April 1, 2021, 10:25pm

That worked, hope you update the script for people like me

I know sentenpiece is not a library of yours, but is it normal to use 29GB of RAM and 1 CPU core for this library for modeling 9.1M sentences?

My logs:

trainer_interface.cc(267) LOG(INFO) Loading corpus: data/train.txt
trainer_interface.cc(287) LOG(WARNING) Found too long line (25081 > 4192).
trainer_interface.cc(289) LOG(WARNING) Too long lines are skipped in the training.
trainer_interface.cc(290) LOG(WARNING) The maximum length can be changed with --max_sentence_length= flag.
trainer_interface.cc(139) LOG(INFO) Loaded 1000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 2000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 3000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 4000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 5000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 6000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 7000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 8000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 9000000 lines
trainer_interface.cc(114) LOG(WARNING) Too many sentences are loaded! (9110889), which may slow down training.
trainer_interface.cc(116) LOG(WARNING) Consider using --input_sentence_size= and --shuffle_input_sentence=true.
trainer_interface.cc(119) LOG(WARNING) They allow to randomly sample sentences from the entire corpus.
trainer_interface.cc(315) LOG(INFO) Loaded all 9110889 sentences
trainer_interface.cc(321) LOG(INFO) Skipped 98 too long sentences.
trainer_interface.cc(330) LOG(INFO) Adding meta_piece:
trainer_interface.cc(330) LOG(INFO) Adding meta_piece:
~~trainer_interface.cc(330) LOG(INFO) Adding meta_piece:~~
trainer_interface.cc(335) LOG(INFO) Normalizing sentences…
trainer_interface.cc(384) LOG(INFO) all chars count=1331305324
trainer_interface.cc(392) LOG(INFO) Done: 100% characters are covered.
trainer_interface.cc(402) LOG(INFO) Alphabet size=3502
trainer_interface.cc(403) LOG(INFO) Final character coverage=1
trainer_interface.cc(435) LOG(INFO) Done! preprocessed 9110889 sentences.
unigram_model_trainer.cc(129) LOG(INFO) Making suffix array…

panosk · April 2, 2021, 8:03am

Hello,

Yes, it’s perfectly normal with so many sentences. Also note that at some point more cores are used (I think after making the suffix array).

chopinml · April 2, 2021, 10:41am

Thank you panosk,

Hopefully it will use multi cores in next phases

It’s been 14 hours and still in “making suffix array” phase. I will create a request in library’s repo may be they can improve something with CPU threads, or use CUDA etc.

panosk · April 2, 2021, 10:48am

This, however, is not normal. Do you have that much RAM or swapping is used?

chopinml · April 2, 2021, 11:37am

I’m running on my Laptop, 6GB is actual, 23G is swap.

I started with 8G,16G swaps and they “killed” the process printing errors.

Now it’s running. Here is my actual screenshot

chopinml · April 3, 2021, 2:47pm

For information, the running phase finished today. But it took more than 2 days for sure. Since it is not writing the timestamp information can not give the exact time. But at least 50 hours I can say.

The model and vocab file is just 798KB and 589KB.

You should add this information to the tutorial so people will not expect a quick run from the shell script, may be just add these files and give information like.

“You can run this script and wait for 40-50 hours to build model and vocab files from 9M sentences or just download these two files and continue with training etc”.

Thank you all for your help for now. I will continue with training phase and see what error messages I will encounter in the future

panosk · April 4, 2021, 8:55am

This is not the proper way to work with the script. Most example NMT training pipelines assume that the user has a strong machine that can handle the workload. Otherwise, the user should adapt, if possible, the scripts to their user-case and not use them blindly without checking with their machine specs.
In this case, you should use the --input_sentence_size option in the sentencepiece section of the script to decrease the number of sentences fed to sentencepiece so they can fit in your available RAM:

spm_train --input=$DATA_PATH/train.txt --model_prefix=$DATA_PATH/wmt$sl$tl \
           --vocab_size=$vocab_size --character_coverage=1 --input_sentence_size=1000000

Try to experiment with a value that will fit the sentences in your RAM without using swapping --that way, training should take only a few minutes. With this lower value the sentencepiece model will probably not cover all subwords in the corpus, but at least the process should be able to finish in a reasonable amount of time.

chopinml · April 4, 2021, 10:46am

Yes I learned it with a hard way.

Nothing about server spec is written on the Tutorial page
https://opennmt.net/OpenNMT-py/examples/Translation.html

or Quickstart page.
https://opennmt.net/OpenNMT-py/quickstart.html

or Github page

github.com

OpenNMT/OpenNMT-py/blob/master/examples/scripts/prepare_wmt_data.sh

#!/bin/bash

##################################################################################
# The default script downloads the commoncrawl, europarl and newstest2014 and
# newstest2017 datasets. Files that are not English or German are removed in
# this script for tidyness.You may switch datasets out depending on task.
# (Note that commoncrawl europarl-v7 are the same for all tasks).
# http://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz
# http://www.statmt.org/wmt13/training-parallel-europarl-v7.tgz
#
# WMT14 http://www.statmt.org/wmt14/training-parallel-nc-v9.tgz
# WMT15 http://www.statmt.org/wmt15/training-parallel-nc-v10.tgz
# WMT16 http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz
# WMT17 http://data.statmt.org/wmt17/translation-task/training-parallel-nc-v12.tgz
# Note : there are very little difference, but each year added a few sentences
# new WMT17 http://data.statmt.org/wmt17/translation-task/rapid2016.tgz
#
# For WMT16 Rico Sennrich released some News back translation
# http://data.statmt.org/rsennrich/wmt16_backtranslations/en-de/
#

This file has been truncated. show original

Also there is a “false” block which is required for the shell script itself as françois said in the above reply too.

You should mention these in somewhere in the website and modify the shell script so people can have at least some knowledge about real world examples.

francoishernandez · April 6, 2021, 8:02am

No doc is perfect. Feel free to PR some changes.

chopinml · April 8, 2021, 3:30pm

Yes sure, I will add some notes both in quick start

Don’t expect good results from 30K sentences, the translation file will look like Das die, Hotel etc and it is normal
Sentencepiece for 9M sentences takes about 29GB of RAM, you may create swap or change spm_train with the command below:
spm_train --input=$DATA_PATH/train.txt --model_prefix=$DATA_PATH/wmt$sl$tl
–vocab_size=$vocab_size --character_coverage=1 --input_sentence_size=1000000