OpenNMT-py 2.0 release

BramVanroy · September 25, 2020, 10:37pm

Very nice to see that there is still a lot of active development! We will not be upgrading soon as source word features is an important feature that we extensively use, but I do appreciate the continuing work on the library!

francoishernandez · September 26, 2020, 8:19am

Thanks Bram for your feedback!
We will most probably to put back source feature support at some point.

anderleich · October 23, 2020, 12:27pm

Hi,
You mentioned source word features were dropped from this version, at least temporarily. Do they imply an issue on the fly? When do you expect to add them?
Thanks

francoishernandez · October 23, 2020, 1:14pm

Do they imply an issue on the fly?

Not sure what you mean, here. No particular issue, except that you can’t use them for now.
If you’re asking about why it was dropped, it’s because it requires some adaptations in the new dynamic inputters pipeline, that we didn’t get to yet.

It should not be particularly difficult, just requires a bit of time and testing.
I think the main remaining topic is the vocab building of the features field(s). (The _feature_tokenize stuff is actually still there, but won’t work without the proper adaptations upstream.)
Feel free to contribute if you feel like it.

ajitesh3 · January 29, 2021, 7:54am

Hello All, @francoishernandez
I used to train using Opennmt-py (last version 0.9.2), I see lots of updates have been made since then.
I have couple of doubts and confusion
My previous pipeline used to be like this:

Apply BPE using sentencepiece models trained outside of opennmt, on the source language and target language
Run python preprocess.py
python preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/procData_2019
Run train.py
python train.py -data data/procData_2019 -save_model model/model_2019-model -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 100000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 0.25 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -world_size 1 -gpu_ranks 0

I can see now there is no preprocess.py (build_vocab.py, is it the same ?)
Can anyone tell how exactly to achieve above in release.?

francoishernandez · January 29, 2021, 8:53am

Hey @ajitesh3

You might want to have a look at the updated documentation, especially:

ajitesh3 · January 29, 2021, 10:09am

Thanks @francoishernandez for replying.
Most of the things I understood from above docs. However few doubts:

In the original sentencepiece method there are two options, either to train model using BPE or unigram(model_type). In the transform option is it only BPE based sentencepiece model? or how do we specify it. Also where do we specify the vocab_size of sentencepiece model.
Also how do we define user_defined_symbols in sentencepiece models in this case?
In FAQ,On the fly tokenization. its written as
Tokenization options
src_subword_type: sentencepiece
src_subword_model: examples/subword.spm.model

however in Translation example its written as below
Corpus opts:
data:
commoncrawl:
path_src: data/wmt/commoncrawl.de-en.en
path_tgt: data/wmt/commoncrawl.de-en.de
transforms: [sentencepiece, filtertoolong]

Where exactly do we define the sentencepiece arguments?

francoishernandez · January 29, 2021, 10:14am

Training and using a sentencepiece model are two different things.
The new features in 2.0 are only for using. You still need to train your sentencepiece model beforehand.

The tokenization options are listed here.

Also, the Translation example has a sentencepiece configuration with the aforementioned options.

ajitesh3 · January 29, 2021, 10:55am

Ok I am trying to make it work for me. Also I am currently using OpenNMT version 0.9.2(basically I forked out then). Should I take the latest pull and continue after resolving conflicts accepting incoming changes?That should be okay right. Or is it highly advisable to clone afresh. (PS: I have added lots of changes in the forked repo)

francoishernandez · January 29, 2021, 10:59am

Only you can know the extent of the changes in your fork. If you want to merge the 2.0 changes you will need to look carefully at the various changes. It probably won’t just be some simple conflict resolutions.
All the major structural changes happened here for reference:

Also, it will depend on which parts you made changes to. For instance, most changes happened in preprocessing / training parts, but everything related to inference, models, optimizers, etc. did not change.

ajitesh3 · January 29, 2021, 11:04am

In the OpeNMT context I made changes only to preprocessing (added word tokenizer and sentencepiece tokenizer as part of preprocessing pipeline).
For training I was directly using python train.py with all the arguments. I checked the config yaml. I believe its same only. Only difference is now it is using the config yaml only, previously the config arguments were part of train.py

francoishernandez · January 29, 2021, 11:05am

In the OpeNMT context I made changes only to preprocessing (added word tokenizer and sentencepiece tokenizer as part of preprocessing pipeline).

Then you can probably just drop your changes and start anew from 2.0, as all this can now be done on the fly when training.

ajitesh3 · January 29, 2021, 11:06am

sure @francoishernandez
I will try it once.
Thanks for help

ajitesh3 · January 29, 2021, 4:59pm

@francoishernandez
Is there a way to see the transformed corpus after the tokenisation part.
It should be saved in save_data: data/wmt/run/example
However, no such file is getting generated for me, training is getting started directly.

ajitesh3 · January 30, 2021, 6:05am

I am having this weird problem while merging my branch with OpenNMT master(latest one), few files are not coming after merging and resolving conflicts. eg. onmt.utils.earlystopping

francoishernandez · January 30, 2021, 11:31am

Is there a way to see the transformed corpus after the tokenisation part.
It should be saved in save_data: data/wmt/run/example
However, no such file is getting generated for me, training is getting started directly.

Have a look at the n_sample opt:

github.com

OpenNMT/OpenNMT-py/blob/36748a5f280fbe781a86a82b7df85166796d49d7/onmt/opts.py#L90-L94


'-n_sample', '--n_sample',
type=int, default=(5000 if build_vocab_only else 0),
help=("Build vocab using " if build_vocab_only else "Stop after save ")
+ "this number of transformed samples/corpus. Can be [-1, 0, N>0]. "
"Set to -1 to go full corpus, 0 to skip.")

I am having this weird problem while merging my branch with OpenNMT master(latest one), few files are not coming after merging and resolving conflicts. eg. onmt.utils.earlystopping

Not sure to understand your issue here. The file you mention for instance has not been modified for two years, so it’s ‘normal’ that it’s not affected by any merge. Anyways, this is more of a git issue than an OpenNMT one.

ajitesh3 · February 1, 2021, 2:18pm

hi @francoishernandez

In the data: commoncrawl, in transform it is mentioned sentencepiece tokenisation
What does this refer to then src_subword_type which is mentioned as a separate key.Is it only for pyonmttok based model?

I am using a sentencepiece(BPE) (from source) model, do i need to mention that type at both place?
One more doubt, when trained using subword-nmt module then it is being refereed as BPE only right ?
When trained using sentencepiece then type is sentencepiece (no type distinction b/w senetncepiece BPE or unigram) ?

francoishernandez · February 1, 2021, 3:09pm

src_subword_type is indeed only for the OpenNMT Tokenizer / onmt_tokenize transform.

There are two different things here:

the type of model (sentencepiece or bpe);
the tool you want to use (onmt_tokenize, sentencepiece, bpe).

BPE here refers to subword-nmt like bpe (the model is in fact a list of merge operations in a plain text file), whereas a sentencepiece(bpe) model will be in the sentencepiece format.

If you have a BPE model, you can use either the bpe or the onmt_tokenize transform.
If you have a sentencepiece model, you can use either the sentencepiece or onmt_tokenize transform.

ajitesh3 · February 1, 2021, 4:11pm

sure got it @francoishernandez
thank you

ajitesh3 · February 2, 2021, 10:54am

hey @francoishernandez
I was looking at the sample generated before training begins. I could see one difference.
After tokenisation I am getting sentence as below:
▁Per usal ▁of ▁the ▁file ▁shows ▁that ▁a ▁counter ▁affidavit ▁was ▁filed ▁by ▁respondent ▁No . ▁ 2 ▁under ▁index ▁dated ▁1 7 . ▁ 05 . ▁ 2000 , ▁wherein ▁the ▁impugned ▁order ▁dated ▁ 27 . ▁1 0 . ▁1 9 9 9 ▁was ▁sought ▁to ▁be ▁supported .
▁The ▁decision ▁in ▁R ▁v ▁Sp en cer 287 ▁( 20 1 4 ) ▁was ▁related ▁to ▁inform ational ▁privacy .
▁The ▁further ▁details ▁of ▁the ▁chapter ▁are ▁not ▁necessary ▁for ▁our ▁purpose .

However for the same corpus and sentencepiece model if I use sp.encode or sp.encode_as_pieces I am getting tokenised test in below format:
[‘▁Per’, ‘usal’, ‘▁of’, ‘▁the’, ‘▁file’, ‘▁shows’, ‘▁that’, ‘▁a’, ‘▁counter’, ‘▁affidavit’, ‘▁was’, ‘▁filed’, ‘▁by’, ‘▁respondent’, ‘▁No’, ‘.’, ‘▁’, ‘2’, ‘▁under’, ‘▁index’, ‘▁dated’, ‘▁1’, ‘7’, ‘.’, ‘▁’, ‘05’, ‘.’, ‘▁’, ‘2000’, ‘,’, ‘▁wherein’, ‘▁the’, ‘▁impugned’, ‘▁order’, ‘▁dated’, ‘▁’, ‘27’, ‘.’, ‘▁1’, ‘0’, ‘.’, ‘▁1’, ‘9’, ‘9’, ‘9’, ‘▁was’, ‘▁sought’, ‘▁to’, ‘▁be’, ‘▁supported’, ‘.’]
[‘▁The’, ‘▁decision’, ‘▁in’, ‘▁R’, ‘▁v’, ‘▁Sp’, ‘en’, ‘cer’, ‘287’, ‘▁(’, ‘20’, ‘1’, ‘4’, ‘)’, ‘▁was’, ‘▁related’, ‘▁to’, ‘▁inform’, ‘ational’, ‘▁privacy’, ‘.’]
[‘▁The’, ‘▁further’, ‘▁details’, ‘▁of’, ‘▁the’, ‘▁chapter’, ‘▁are’, ‘▁not’, ‘▁necessary’, ‘▁for’, ‘▁our’, ‘▁purpose’, ‘.’]

What is the reason for this difference? I am refering to sentencepiece github python wrapper page sentencepiece/README.md at master · google/sentencepiece · GitHub

Previously I have been using python based package of sentencepiece and used to call encode function which gave me tokenised output in the format as shown later.