OpenNMT-py 2.0 release

ajitesh3 · February 2, 2021, 10:54am

hey @francoishernandez
I was looking at the sample generated before training begins. I could see one difference.
After tokenisation I am getting sentence as below:
▁Per usal ▁of ▁the ▁file ▁shows ▁that ▁a ▁counter ▁affidavit ▁was ▁filed ▁by ▁respondent ▁No . ▁ 2 ▁under ▁index ▁dated ▁1 7 . ▁ 05 . ▁ 2000 , ▁wherein ▁the ▁impugned ▁order ▁dated ▁ 27 . ▁1 0 . ▁1 9 9 9 ▁was ▁sought ▁to ▁be ▁supported .
▁The ▁decision ▁in ▁R ▁v ▁Sp en cer 287 ▁( 20 1 4 ) ▁was ▁related ▁to ▁inform ational ▁privacy .
▁The ▁further ▁details ▁of ▁the ▁chapter ▁are ▁not ▁necessary ▁for ▁our ▁purpose .

However for the same corpus and sentencepiece model if I use sp.encode or sp.encode_as_pieces I am getting tokenised test in below format:
[‘▁Per’, ‘usal’, ‘▁of’, ‘▁the’, ‘▁file’, ‘▁shows’, ‘▁that’, ‘▁a’, ‘▁counter’, ‘▁affidavit’, ‘▁was’, ‘▁filed’, ‘▁by’, ‘▁respondent’, ‘▁No’, ‘.’, ‘▁’, ‘2’, ‘▁under’, ‘▁index’, ‘▁dated’, ‘▁1’, ‘7’, ‘.’, ‘▁’, ‘05’, ‘.’, ‘▁’, ‘2000’, ‘,’, ‘▁wherein’, ‘▁the’, ‘▁impugned’, ‘▁order’, ‘▁dated’, ‘▁’, ‘27’, ‘.’, ‘▁1’, ‘0’, ‘.’, ‘▁1’, ‘9’, ‘9’, ‘9’, ‘▁was’, ‘▁sought’, ‘▁to’, ‘▁be’, ‘▁supported’, ‘.’]
[‘▁The’, ‘▁decision’, ‘▁in’, ‘▁R’, ‘▁v’, ‘▁Sp’, ‘en’, ‘cer’, ‘287’, ‘▁(’, ‘20’, ‘1’, ‘4’, ‘)’, ‘▁was’, ‘▁related’, ‘▁to’, ‘▁inform’, ‘ational’, ‘▁privacy’, ‘.’]
[‘▁The’, ‘▁further’, ‘▁details’, ‘▁of’, ‘▁the’, ‘▁chapter’, ‘▁are’, ‘▁not’, ‘▁necessary’, ‘▁for’, ‘▁our’, ‘▁purpose’, ‘.’]

What is the reason for this difference? I am refering to sentencepiece github python wrapper page sentencepiece/README.md at master · google/sentencepiece · GitHub

Previously I have been using python based package of sentencepiece and used to call encode function which gave me tokenised output in the format as shown later.

francoishernandez · February 2, 2021, 10:59am

I don’t understand your issue. I looks like the tokenization is the same, it’s just a question of format, string vs list.
Also, this has nothing to do with the original topic (OpenNMT-py 2.0 release). You might want to open another thread to discuss tokenization specific questions.

ajitesh3 · February 2, 2021, 11:00am

sure @francoishernandez
Thank you

anderleich · February 22, 2021, 1:45pm

Hi,

Any updates on source word features?

francoishernandez · February 22, 2021, 2:34pm

Someone on github wanted to have a look, but not sure they made any progress yet.

seamusl · April 18, 2021, 8:00pm

Hi, I just want to double check if I need to run any preprocessing steps when using OpenNMT 2.0. ? I currently have 2 files of parallel text which are aligned and I have used these to develop a translation model. I have setup the yaml file to use omnt_tokenize and a well performing model has been developed.

I did NOT perform any specific punctuation normalization, tokenization or true casing - are these all looked after automatically (i.e. by default) with on the fly tokenization of OpenNMT 2.0. ? With the previous version of OpenNMT, would such steps have to be performed with separate Moses scripts (or similar scripts)? Thanks.

francoishernandez · April 19, 2021, 9:05am

are these all looked after automatically (i.e. by default) with on the fly tokenization of OpenNMT 2.0. ?

It depends how you set it up in your config.

With the previous version of OpenNMT, would such steps have to be performed with separate Moses scripts (or similar scripts)?

BEFORE (legacy): preprocess takes already tokenized data, dumps it into shards and dumps the associated vocab computed on the whole data
[tokenized data] =(preprocessing)=> data_*.pt + vocab.pt

NOW : you can do as you wish, as long as you provide the training script with some data and a corresponding vocab.

If you want to use the same (already processed) data as before, you can use your already tokenized data, and do not apply anything on the fly.
If you want to tokenize on the fly, then you need to set the transforms you would like to apply in the config.

How you want your data to be handled in terms of tokenization, normalization, etc. is up to you.

seamusl · April 19, 2021, 9:21am

I have the following in my config:

corpus_1:
path_src: data/src-train.txt
path_tgt: data/tgt-train.txt
transforms: [onmt_tokenize, filtertoolong]
weight: 1
valid:
path_src: data/src-val.txt
path_tgt: data/tgt-val.txt
transforms: [onmt_tokenize]

So I suppose the question is what does [onmt_tokenize] do by default? Does it do punctuation normalization, tokenization and true casing by default or what do I need in config for those steps? Thanks.

francoishernandez · April 19, 2021, 9:25am

github.com

OpenNMT/Tokenizer/blob/master/docs/options.md

# Tokenization options

This file documents the options of the Tokenizer interface which can be used in:

* command line client
* C++ API
* Python API

*The exact name format of each option may be different depending on the API used.*

**Terminology:**

* *joiner*: special character indicating that the surrounding tokens should be merged when detokenized
* *spacer*: special character indicating that a space should be introduced when detokenized
* *placeholder* (or *protected sequence*): sequence of characters delimited by ｟ and ｠ that should not be segmented

**Table of contents:**

1. [General](#general)
1. [Case annotation](#case-annotation)

This file has been truncated. show original

onmt_tokenize is configured in two ways:

major subword related options are flags at the onmt config level
every other option should be included in the onmttok_kwargs dict string(s)

See this example.

seamusl · April 19, 2021, 9:32am

Looks good.

Thanks.

meghajain-1711 · May 24, 2021, 4:30pm

Hi @francoishernandez ,

It seems in OpenNMT-tf has a transition guide here : 2.0 Transition Guide — OpenNMT-tf 2.18.1 documentation
Do we have anything similar in the documentation for openNMT-py as well ?

Thanks and Regards
Megha Jain

francoishernandez · May 25, 2021, 12:16pm

No there is no such thing yet.
Changes in 2.0 are mostly in the data loading pipeline. The rest (models, loss, training, etc.) is unchanged.
You can refer to the updated FAQ for examples of the new paradigm.

meghajain-1711 · May 25, 2021, 6:22pm

Thanks @francoishernandez