OpenNMT-py on-the-fly tokenization with subword regularization

mazida · January 19, 2023, 6:57am

I have been trying to use on-the-fly tokenization using SentencePiece with OpenNMT-py. According to the OpenNMT-py tutorial page, I used the following config:

Tokenization options

src_subword_type: sentencepiece
src_subword_model: pah to the SP model
tgt_subword_type: sentencepiece
tgt_subword_model: SP model path

Number of candidates for SentencePiece sampling

subword_nbest: 64

Smoothing parameter for SentencePiece sampling

subword_alpha: 0.1

Specific arguments for pyonmttok

src_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”
tgt_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”

src_vocab: path to the SP vocab converted to ONMT format
tgt_vocab: path to the SP vocab converted to ONMT format
overwrite: False

data:
corpus_1:
path_src: Tokenized file path
path_tgt: Tokenized file path
transforms: [onmt_tokenize, filtertoolong]
valid:
path_src: Tokenized file path
path_tgt: Tokenized file path
transforms: [onmt_tokenize, filtertoolong]

But during training, I see the following :
"
corpus_1’s transforms: TransformPipe(ONMTTokenizerTransform(share_vocab=False, src_subword_kwargs={‘sp_model_path’: ‘…’, ‘sp_nbest_size’: 1, ‘sp_alpha’: 0}, src_onmttok_kwargs={‘mode’: ‘none’, ‘spacer_annotate’: True}, tgt_subword_kwargs={‘sp_model_path’: ‘…’, ‘sp_nbest_size’: 1, ‘sp_alpha’: 0}, tgt_onmttok_kwargs={‘mode’: ‘none’, ‘spacer_annotate’: True}), FilterTooLongTransform(src_seq_length=200, tgt_seq_length=200))

"
As it can be seen, it DOES NOT PERFORM any subword regularzation at all !!!
What went wrong?
Any help would be appreciated.

Thanking in advance,

Regards,
Mazida

francoishernandez · January 26, 2023, 8:33pm

I think the docs is not fully up to date here.
We expect {src,tgt}_subword_alpha // {src,tgt}_subword_nbest opts instead of “non-sided” subword_alpha // subword_nbest.

github.com

OpenNMT/OpenNMT-py/blob/47534e9ce0e81b015516c649d2ecc50d0a15cb36/onmt/transforms/tokenize.py#L45-L54


      
          group.add('-src_subword_alpha', '--src_subword_alpha',
                    type=float, default=0,
                    help="Smoothing parameter for sentencepiece unigram "
                         "sampling, and dropout probability for BPE-dropout. "
                         "(source side)")
          group.add('-tgt_subword_alpha', '--tgt_subword_alpha',
                    type=float, default=0,
                    help="Smoothing parameter for sentencepiece unigram "
                         "sampling, and dropout probability for BPE-dropout. "
                         "(target side)")