BPE_dropout training

mazida · January 17, 2023, 10:48am

Hello…
I have been trying to use BPE-dropout during training with on-the-fly tokenization. Training looks good with the validation ppl dropping. But, on testing (during translate) the output it generates has a mix of source and target language, with majority tokens belonging to the source language rather than the target language.
For example, for a English->Assamese model ,
if the src sent (BPE applied) is :
" farming is the primary occu@@ pation of man ."

the machine output is:
“farming @ @ farming @ @ farming @ @ farming @ @ farming @ @ farming @ @ occu@ @ occu@ @ occu@ @ occu@ @ occu@ @ occu@ @ farming @ @ occu@ @ occu@ @ occu@ @ occu@ @ the @ @ farming @ @ the @ @ farming @ @ া য়।”

target output (with BPE applied) should be:
“কৃষি মানুহৰ এক প্ৰধান জীৱ@@ িকা ।”

I cannot figure out the error…

I used the following in the train.yaml file:

Tokenization options

src_subword_type: bpe
src_subword_model: path to the source language BPE codes file
tgt_subword_type: bpe
tgt_subword_model: path to the target language BPE codes file

Specific arguments for pyonmttok

src_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”
tgt_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”

#smoothing parameter for sentencepiece regularization / dropout probability for BPE, source side
src_subword_alpha: 0.1

#smoothing parameter for sentencepiece regularization / dropout probability for BPE, target side
tgt_subword_alpha: 0.1

For translating I had used the standard BPE (no dropout).

The OpenNMT-py documentattion provides the procedure for Sentencepiece, but cannot find any direct tutorial for BPE-Dropout.
Please Help…I have been trying desperately since the past three days.

Regards,
Mazida.

francoishernandez · January 26, 2023, 8:44pm

“farming @ @ farming @ @ farming @ @ farming @ @ farming @ @ farming @ @ occu@ @ occu@ @ occu@ @ occu@ @ occu@ @ occu@ @ farming @ @ occu@ @ occu@ @ occu@ @ occu@ @ the @ @ farming @ @ the @ @ farming @ @ া য়।”

This looks like your data is not properly tokenized (tokenized too much actually).
Also, using onmt_tokenize with spacer_annotate, you should not have @@ joiners, but ▁ spacers. So either your data is pre-tokenized with subword-nmt, or maybe you set the bpe transform instead of onmt_tokenize.

Basically, your tokenization options are not right. You might want to try and find the proper options you really want to use by looking at the Tokenizer docs, and then adapt your config accordingly.

github.com

OpenNMT/Tokenizer/blob/master/docs/options.md

# Tokenization options

This file documents the options of the Tokenizer interface which can be used in:

* command line client
* C++ API
* Python API

*The exact name format of each option may be different depending on the API used.*

**Terminology:**

* *joiner*: special character indicating that the surrounding tokens should be merged when detokenized
* *spacer*: special character indicating that a space should be introduced when detokenized
* *placeholder* (or *protected sequence*): sequence of characters delimited by ｟ and ｠ that should not be segmented

**Table of contents:**

1. [General](#general)
1. [Case annotation](#case-annotation)

This file has been truncated. show original

Also, general comment, if you want to maximize your chance to get help, post your FULL config.

mazida · February 1, 2023, 7:01am

Thank you for your response.

mazida · February 1, 2023, 8:08am

Yes, you are right…I had pre-tokenised with subword-nmt.

I am able to figure out the problem now:

Training config:

Tokenization options

src_subword_type: bpe
src_subword_model: path to source BPE codes
tgt_subword_type: bpe
tgt_subword_model: path to target BPE codes

src_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”
tgt_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”

src_subword_alpha: 0.1
tgt_subword_alpha: 0.1

src_vocab: path to source vocab
tgt_vocab: path to target vocab
overwrite: False

src_seq_length: 200
tgt_seq_length: 200

data:

corpus_1:
  path_src: path to RAW source file 
  path_tgt: path to RAW target file
  transforms: [onmt_tokenize, filtertoolong]
valid:
  path_src: path to RAW source file
  path_tgt: path to RAW target file
  transforms: [onmt_tokenize, filtertoolong]

Inference

Problem: I had pretokenised the TEST file with subword-nmt which gave me the output with @@ as joiner → doesn’t seem to work.

Questions:

Is it correct to use ‘spacer_annotate’ while training with BPE_dropout or should we go with ‘joiner_annotate’ ?
How should I use onmt-tokenizer during inference? I did try (for the very first time, I am familiar with using subword-nmt) as follows:

import pyonmttok
tokenizer = pyonmttok.Tokenizer(“conservative”, bpe_model_path=…,
bpe_vocab_path=…, \
bpe_vocab_threshold= 1,
spacer_annotate=True)

But doesn’t seem to work.