I have been trying to use on-the-fly tokenization using SentencePiece with OpenNMT-py. According to the OpenNMT-py tutorial page, I used the following config:
Tokenization options
src_subword_type: sentencepiece
src_subword_model: pah to the SP model
tgt_subword_type: sentencepiece
tgt_subword_model: SP model path
Number of candidates for SentencePiece sampling
subword_nbest: 64
Smoothing parameter for SentencePiece sampling
subword_alpha: 0.1
Specific arguments for pyonmttok
src_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”
tgt_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”
src_vocab: path to the SP vocab converted to ONMT format
tgt_vocab: path to the SP vocab converted to ONMT format
overwrite: False
data:
corpus_1:
path_src: Tokenized file path
path_tgt: Tokenized file path
transforms: [onmt_tokenize, filtertoolong]
valid:
path_src: Tokenized file path
path_tgt: Tokenized file path
transforms: [onmt_tokenize, filtertoolong]
But during training, I see the following :
"
corpus_1’s transforms: TransformPipe(ONMTTokenizerTransform(share_vocab=False, src_subword_kwargs={‘sp_model_path’: ‘…’, ‘sp_nbest_size’: 1, ‘sp_alpha’: 0}, src_onmttok_kwargs={‘mode’: ‘none’, ‘spacer_annotate’: True}, tgt_subword_kwargs={‘sp_model_path’: ‘…’, ‘sp_nbest_size’: 1, ‘sp_alpha’: 0}, tgt_onmttok_kwargs={‘mode’: ‘none’, ‘spacer_annotate’: True}), FilterTooLongTransform(src_seq_length=200, tgt_seq_length=200))
"
As it can be seen, it DOES NOT PERFORM any subword regularzation at all !!!
What went wrong?
Any help would be appreciated.
Thanking in advance,
Regards,
Mazida