Doubts wrt the sentencepiece encoded/decoded format in OpenNMT-py v2

HI All,
I have been trying the latest version of OpenNMT-py for NMT training.
I have couple of doubts and confusion wrt the format in which the sentencepiece BPE sentences are input for model training and inference step.(It might not directly has to do anything with OpeNMT-py, but still any clarification would help).

Basically I am using sentencepiece python package(pip) to train the Sentencepiece model.
when I n sample out the data which is going for NMT transformer based training in OpenNMT py (using n_sample, follwing this Translation — OpenNMT-py documentation), I can see the samples in below format

▁The ▁decision ▁in ▁R ▁v ▁Sp en cer 287 ▁( 20 1 4 ) ▁was ▁related ▁to ▁inform ational ▁privacy .
▁The ▁further ▁details ▁of ▁the ▁chapter ▁are ▁not ▁necessary ▁for ▁our ▁purpose .

but when I externally try to encode the sentence using the same sentencepiece model I get sentence in below format(i.e the list of subword string tokens)
[‘▁The’, ‘▁decision’, ‘▁in’, ‘▁R’, ‘▁v’, ‘▁Sp’, ‘en’, ‘cer’, ‘287’, ‘▁(’, ‘20’, ‘1’, ‘4’, ‘)’, ‘▁was’, ‘▁related’, ‘▁to’, ‘▁inform’, ‘ational’, ‘▁privacy’, ‘.’]
[‘▁The’, ‘▁further’, ‘▁details’, ‘▁of’, ‘▁the’, ‘▁chapter’, ‘▁are’, ‘▁not’, ‘▁necessary’, ‘▁for’, ‘▁our’, ‘▁purpose’, ‘.’]

Although the tokenisation is same, only the format is varying(i.e list vs string)
but this is causing issue in translation.

So once the model is trained what I am doing is I am encoding my test sentences using the same python package for sentencepiece (calling encode_as_pieces from sentencepiece class) which is again giving me output in list of subwords token format, which I am merging to get a format similar to what was in OpenNMT training sample (▁The ▁further ▁details ▁of ▁the ▁chapter ▁are ▁not ▁necessary ▁for ▁our ▁purpose .)
The output again is in this format (▁আর ভি ▁রবীন্দ্র ন ▁ও ▁কেএস ▁রাধাকৃষ্ণান)
To detoknise I again have to split them into list of subwords token (separated by space) and call sentencepiece DecodePieces funtion

But what is happening is lots of _ are left in decoded sentence (আরভি রবীন্দ্রন ও কেএস▁রাধাকৃষ্ণান)
Also I am not sure is this the right way, am I missing something as lots of sentencepiece specific preprocessing and post preprocessing has to be done to arrive at a particular format. however it was quite simpler in OpenNMT quickstart example.

Also wanted to add, the ctranslate2.translate_batch funtions also accpets “A list of list of string.”,
so for that also above manipulations have to be made to make it work. Am i doing something wrong or this is expected?

Hello!

Working with Asian languages is slightly tricky. If you find that something does not work well with SentencePiece, maybe it is a good opportunity to report the issue on their GitHub repository.

However, there are two things you can try first:

1- You can revise your scripts against mine here.

Note that for Asian languages, I use filter_without_tokenization.py because if you need to tokenize first, you have to use a library that supports Indic languages like iNLTK, Indic NLP Library, or StanfordNLP Stanza . However, it is still okay to use SentencePiece directly without tokenization.

2- You can try to use OpenNMT-py transforms on the fly.

  • Train a SentencePiece subword model. In my experiments for Hindi, I found BPE gives better results than the default Unigram model.
  • Use your subword model in the YALM file; I will give an example below.
  • After translation, use the target model to decode the output.
# Training files
data:
    corpus_1:
        path_src: data/train.en
        path_tgt: data/train.hi
        transforms: [sentencepiece, filtertoolong]
    valid:
        path_src: data/dev.en
        path_tgt: data/dev.hi
        transforms: [sentencepiece, filtertoolong]


# Tokenization options
src_subword_model: subword/bpe/en.model
tgt_subword_model: subword/bpe/hi.model


# Vocabulary files
src_vocab: run/enhi.vocab.src
tgt_vocab: run/enhi.vocab.tgt


early_stopping: 4
log_file: logs/train.bpe.log

save_model: models/model.enhi

...
...
...

Is it possible to mix a Sentencepiece submodel and with onmt_tokenize as the on-the-fly transform?

The sentencepiece model (spm.model) and vocabs were built successfully using sentencepiece. If I use transforms: [sentencepiece], the training of a model throws an error. However, if I use transforms: [onmt_tokenize], the training step works.

The following is my config:

Where the vocab(s) will be written

src_vocab: data/vocab.src
tgt_vocab: data/vocab.tgt

Tokenisation options

src_subword_type: sentencepiece
src_subword_model: data/spm.model
tgt_subword_type: sentencepiece
tgt_subword_model: data/spm.model

SentencePiece sampling candidates

subword_nbest: 20

Smoothing for SentencePiece sampling

subword_alpha: 0.1

arguments for pyonmttok

src_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”
tgt_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”

Corpus opts:

data:
corpus_1:
path_src: data/src-train.txt
path_tgt: data/tgt-train.txt
transforms: [onmt_tokenize]
weight: 1
valid:
path_src: data/src-val.txt
path_tgt: data/tgt-val.txt
transforms: [onmt_tokenize]

the training of a model throws an error

Not sure why. You might want to post the full error trace.

Yes, you can use onmt_tokenize to tokenize with a sentencepiece model. See this example.

Thanks, Francois. That’s good to know about onmt_tokenize to tokenize with a sentencepiece model. That will work for me.