OpenNMT-py 2.0 release

hey @francoishernandez
I was looking at the sample generated before training begins. I could see one difference.
After tokenisation I am getting sentence as below:
▁Per usal ▁of ▁the ▁file ▁shows ▁that ▁a ▁counter ▁affidavit ▁was ▁filed ▁by ▁respondent ▁No . ▁ 2 ▁under ▁index ▁dated ▁1 7 . ▁ 05 . ▁ 2000 , ▁wherein ▁the ▁impugned ▁order ▁dated ▁ 27 . ▁1 0 . ▁1 9 9 9 ▁was ▁sought ▁to ▁be ▁supported .
▁The ▁decision ▁in ▁R ▁v ▁Sp en cer 287 ▁( 20 1 4 ) ▁was ▁related ▁to ▁inform ational ▁privacy .
▁The ▁further ▁details ▁of ▁the ▁chapter ▁are ▁not ▁necessary ▁for ▁our ▁purpose .

However for the same corpus and sentencepiece model if I use sp.encode or sp.encode_as_pieces I am getting tokenised test in below format:
[‘▁Per’, ‘usal’, ‘▁of’, ‘▁the’, ‘▁file’, ‘▁shows’, ‘▁that’, ‘▁a’, ‘▁counter’, ‘▁affidavit’, ‘▁was’, ‘▁filed’, ‘▁by’, ‘▁respondent’, ‘▁No’, ‘.’, ‘▁’, ‘2’, ‘▁under’, ‘▁index’, ‘▁dated’, ‘▁1’, ‘7’, ‘.’, ‘▁’, ‘05’, ‘.’, ‘▁’, ‘2000’, ‘,’, ‘▁wherein’, ‘▁the’, ‘▁impugned’, ‘▁order’, ‘▁dated’, ‘▁’, ‘27’, ‘.’, ‘▁1’, ‘0’, ‘.’, ‘▁1’, ‘9’, ‘9’, ‘9’, ‘▁was’, ‘▁sought’, ‘▁to’, ‘▁be’, ‘▁supported’, ‘.’]
[‘▁The’, ‘▁decision’, ‘▁in’, ‘▁R’, ‘▁v’, ‘▁Sp’, ‘en’, ‘cer’, ‘287’, ‘▁(’, ‘20’, ‘1’, ‘4’, ‘)’, ‘▁was’, ‘▁related’, ‘▁to’, ‘▁inform’, ‘ational’, ‘▁privacy’, ‘.’]
[‘▁The’, ‘▁further’, ‘▁details’, ‘▁of’, ‘▁the’, ‘▁chapter’, ‘▁are’, ‘▁not’, ‘▁necessary’, ‘▁for’, ‘▁our’, ‘▁purpose’, ‘.’]

What is the reason for this difference? I am refering to sentencepiece github python wrapper page sentencepiece/README.md at master · google/sentencepiece · GitHub

Previously I have been using python based package of sentencepiece and used to call encode function which gave me tokenised output in the format as shown later.

I don’t understand your issue. I looks like the tokenization is the same, it’s just a question of format, string vs list.
Also, this has nothing to do with the original topic (OpenNMT-py 2.0 release). You might want to open another thread to discuss tokenization specific questions.

sure @francoishernandez
Thank you

Hi,

Any updates on source word features?

Someone on github wanted to have a look, but not sure they made any progress yet.

Hi, I just want to double check if I need to run any preprocessing steps when using OpenNMT 2.0. ? I currently have 2 files of parallel text which are aligned and I have used these to develop a translation model. I have setup the yaml file to use omnt_tokenize and a well performing model has been developed.

I did NOT perform any specific punctuation normalization, tokenization or true casing - are these all looked after automatically (i.e. by default) with on the fly tokenization of OpenNMT 2.0. ? With the previous version of OpenNMT, would such steps have to be performed with separate Moses scripts (or similar scripts)? Thanks.

are these all looked after automatically (i.e. by default) with on the fly tokenization of OpenNMT 2.0. ?

It depends how you set it up in your config.

With the previous version of OpenNMT, would such steps have to be performed with separate Moses scripts (or similar scripts)?

BEFORE (legacy): preprocess takes already tokenized data, dumps it into shards and dumps the associated vocab computed on the whole data
[tokenized data] =(preprocessing)=> data_*.pt + vocab.pt

NOW : you can do as you wish, as long as you provide the training script with some data and a corresponding vocab.

  1. If you want to use the same (already processed) data as before, you can use your already tokenized data, and do not apply anything on the fly.
  2. If you want to tokenize on the fly, then you need to set the transforms you would like to apply in the config.

How you want your data to be handled in terms of tokenization, normalization, etc. is up to you.

I have the following in my config:

corpus_1:
path_src: data/src-train.txt
path_tgt: data/tgt-train.txt
transforms: [onmt_tokenize, filtertoolong]
weight: 1
valid:
path_src: data/src-val.txt
path_tgt: data/tgt-val.txt
transforms: [onmt_tokenize]

So I suppose the question is what does [onmt_tokenize] do by default? Does it do punctuation normalization, tokenization and true casing by default or what do I need in config for those steps? Thanks.

onmt_tokenize is configured in two ways:

  1. major subword related options are flags at the onmt config level
  2. every other option should be included in the onmttok_kwargs dict string(s)

See this example.

Looks good.

Thanks.

Hi @francoishernandez ,

It seems in OpenNMT-tf has a transition guide here : 2.0 Transition Guide — OpenNMT-tf 2.18.1 documentation
Do we have anything similar in the documentation for openNMT-py as well ?

Thanks and Regards
Megha Jain

No there is no such thing yet.
Changes in 2.0 are mostly in the data loading pipeline. The rest (models, loss, training, etc.) is unchanged.
You can refer to the updated FAQ for examples of the new paradigm.

Thanks @francoishernandez