Very nice to see that there is still a lot of active development! We will not be upgrading soon as source word features is an important feature that we extensively use, but I do appreciate the continuing work on the library!
Thanks Bram for your feedback!
We will most probably to put back source feature support at some point.
Hi,
You mentioned source word features were dropped from this version, at least temporarily. Do they imply an issue on the fly? When do you expect to add them?
Thanks
Do they imply an issue on the fly?
Not sure what you mean, here. No particular issue, except that you can’t use them for now.
If you’re asking about why it was dropped, it’s because it requires some adaptations in the new dynamic inputters pipeline, that we didn’t get to yet.
It should not be particularly difficult, just requires a bit of time and testing.
I think the main remaining topic is the vocab building of the features field(s). (The _feature_tokenize
stuff is actually still there, but won’t work without the proper adaptations upstream.)
Feel free to contribute if you feel like it.
Hello All, @francoishernandez
I used to train using Opennmt-py (last version 0.9.2), I see lots of updates have been made since then.
I have couple of doubts and confusion
My previous pipeline used to be like this:
-
Apply BPE using sentencepiece models trained outside of opennmt, on the source language and target language
-
Run python preprocess.py
python preprocess.py -train_src data/src-train.txt -train_tgt data/tgt-train.txt -valid_src data/src-val.txt -valid_tgt data/tgt-val.txt -save_data data/procData_2019 -
Run train.py
python train.py -data data/procData_2019 -save_model model/model_2019-model -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 100000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 0.25 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -world_size 1 -gpu_ranks 0
I can see now there is no preprocess.py (build_vocab.py, is it the same ?)
Can anyone tell how exactly to achieve above in release.?
Hey @ajitesh3
You might want to have a look at the updated documentation, especially:
- Quickstart for the basic principles;
- Translation example;
- On the fly tokenization.
Thanks @francoishernandez for replying.
Most of the things I understood from above docs. However few doubts:
-
In the original sentencepiece method there are two options, either to train model using BPE or unigram(model_type). In the transform option is it only BPE based sentencepiece model? or how do we specify it. Also where do we specify the vocab_size of sentencepiece model.
Also how do we define user_defined_symbols in sentencepiece models in this case? -
In FAQ,On the fly tokenization. its written as
Tokenization options
src_subword_type: sentencepiece
src_subword_model: examples/subword.spm.model
however in Translation example its written as below
Corpus opts:
data:
commoncrawl:
path_src: data/wmt/commoncrawl.de-en.en
path_tgt: data/wmt/commoncrawl.de-en.de
transforms: [sentencepiece, filtertoolong]
Where exactly do we define the sentencepiece arguments?
Training and using a sentencepiece model are two different things.
The new features in 2.0 are only for using. You still need to train your sentencepiece model beforehand.
The tokenization options are listed here.
Also, the Translation example has a sentencepiece configuration with the aforementioned options.
Ok I am trying to make it work for me. Also I am currently using OpenNMT version 0.9.2(basically I forked out then). Should I take the latest pull and continue after resolving conflicts accepting incoming changes?That should be okay right. Or is it highly advisable to clone afresh. (PS: I have added lots of changes in the forked repo)
Only you can know the extent of the changes in your fork. If you want to merge the 2.0 changes you will need to look carefully at the various changes. It probably won’t just be some simple conflict resolutions.
All the major structural changes happened here for reference:
Also, it will depend on which parts you made changes to. For instance, most changes happened in preprocessing / training parts, but everything related to inference, models, optimizers, etc. did not change.
In the OpeNMT context I made changes only to preprocessing (added word tokenizer and sentencepiece tokenizer as part of preprocessing pipeline).
For training I was directly using python train.py with all the arguments. I checked the config yaml. I believe its same only. Only difference is now it is using the config yaml only, previously the config arguments were part of train.py
In the OpeNMT context I made changes only to preprocessing (added word tokenizer and sentencepiece tokenizer as part of preprocessing pipeline).
Then you can probably just drop your changes and start anew from 2.0, as all this can now be done on the fly when training.
@francoishernandez
Is there a way to see the transformed corpus after the tokenisation part.
It should be saved in save_data: data/wmt/run/example
However, no such file is getting generated for me, training is getting started directly.
I am having this weird problem while merging my branch with OpenNMT master(latest one), few files are not coming after merging and resolving conflicts. eg. onmt.utils.earlystopping
Is there a way to see the transformed corpus after the tokenisation part.
It should be saved in save_data: data/wmt/run/example
However, no such file is getting generated for me, training is getting started directly.
Have a look at the n_sample
opt:
I am having this weird problem while merging my branch with OpenNMT master(latest one), few files are not coming after merging and resolving conflicts. eg. onmt.utils.earlystopping
Not sure to understand your issue here. The file you mention for instance has not been modified for two years, so it’s ‘normal’ that it’s not affected by any merge. Anyways, this is more of a git issue than an OpenNMT one.
In the data: commoncrawl, in transform it is mentioned sentencepiece tokenisation
What does this refer to then src_subword_type which is mentioned as a separate key.Is it only for pyonmttok based model?
I am using a sentencepiece(BPE) (from source) model, do i need to mention that type at both place?
One more doubt, when trained using subword-nmt module then it is being refereed as BPE only right ?
When trained using sentencepiece then type is sentencepiece (no type distinction b/w senetncepiece BPE or unigram) ?
src_subword_type
is indeed only for the OpenNMT Tokenizer / onmt_tokenize transform.
There are two different things here:
- the type of model (sentencepiece or bpe);
- the tool you want to use (onmt_tokenize, sentencepiece, bpe).
BPE here refers to subword-nmt like bpe (the model is in fact a list of merge operations in a plain text file), whereas a sentencepiece(bpe) model will be in the sentencepiece format.
If you have a BPE model, you can use either the bpe
or the onmt_tokenize
transform.
If you have a sentencepiece model, you can use either the sentencepiece
or onmt_tokenize
transform.
hey @francoishernandez
I was looking at the sample generated before training begins. I could see one difference.
After tokenisation I am getting sentence as below:
▁Per usal ▁of ▁the ▁file ▁shows ▁that ▁a ▁counter ▁affidavit ▁was ▁filed ▁by ▁respondent ▁No . ▁ 2 ▁under ▁index ▁dated ▁1 7 . ▁ 05 . ▁ 2000 , ▁wherein ▁the ▁impugned ▁order ▁dated ▁ 27 . ▁1 0 . ▁1 9 9 9 ▁was ▁sought ▁to ▁be ▁supported .
▁The ▁decision ▁in ▁R ▁v ▁Sp en cer 287 ▁( 20 1 4 ) ▁was ▁related ▁to ▁inform ational ▁privacy .
▁The ▁further ▁details ▁of ▁the ▁chapter ▁are ▁not ▁necessary ▁for ▁our ▁purpose .
However for the same corpus and sentencepiece model if I use sp.encode or sp.encode_as_pieces I am getting tokenised test in below format:
[‘▁Per’, ‘usal’, ‘▁of’, ‘▁the’, ‘▁file’, ‘▁shows’, ‘▁that’, ‘▁a’, ‘▁counter’, ‘▁affidavit’, ‘▁was’, ‘▁filed’, ‘▁by’, ‘▁respondent’, ‘▁No’, ‘.’, ‘▁’, ‘2’, ‘▁under’, ‘▁index’, ‘▁dated’, ‘▁1’, ‘7’, ‘.’, ‘▁’, ‘05’, ‘.’, ‘▁’, ‘2000’, ‘,’, ‘▁wherein’, ‘▁the’, ‘▁impugned’, ‘▁order’, ‘▁dated’, ‘▁’, ‘27’, ‘.’, ‘▁1’, ‘0’, ‘.’, ‘▁1’, ‘9’, ‘9’, ‘9’, ‘▁was’, ‘▁sought’, ‘▁to’, ‘▁be’, ‘▁supported’, ‘.’]
[‘▁The’, ‘▁decision’, ‘▁in’, ‘▁R’, ‘▁v’, ‘▁Sp’, ‘en’, ‘cer’, ‘287’, ‘▁(’, ‘20’, ‘1’, ‘4’, ‘)’, ‘▁was’, ‘▁related’, ‘▁to’, ‘▁inform’, ‘ational’, ‘▁privacy’, ‘.’]
[‘▁The’, ‘▁further’, ‘▁details’, ‘▁of’, ‘▁the’, ‘▁chapter’, ‘▁are’, ‘▁not’, ‘▁necessary’, ‘▁for’, ‘▁our’, ‘▁purpose’, ‘.’]
What is the reason for this difference? I am refering to sentencepiece github python wrapper page sentencepiece/README.md at master · google/sentencepiece · GitHub
Previously I have been using python based package of sentencepiece and used to call encode function which gave me tokenised output in the format as shown later.