HI All,
I have been trying the latest version of OpenNMT-py for NMT training.
I have couple of doubts and confusion wrt the format in which the sentencepiece BPE sentences are input for model training and inference step.(It might not directly has to do anything with OpeNMT-py, but still any clarification would help).
Basically I am using sentencepiece python package(pip) to train the Sentencepiece model.
when I n sample out the data which is going for NMT transformer based training in OpenNMT py (using n_sample, follwing this Translation — OpenNMT-py documentation), I can see the samples in below format
▁The ▁decision ▁in ▁R ▁v ▁Sp en cer 287 ▁( 20 1 4 ) ▁was ▁related ▁to ▁inform ational ▁privacy .
▁The ▁further ▁details ▁of ▁the ▁chapter ▁are ▁not ▁necessary ▁for ▁our ▁purpose .
but when I externally try to encode the sentence using the same sentencepiece model I get sentence in below format(i.e the list of subword string tokens)
[‘▁The’, ‘▁decision’, ‘▁in’, ‘▁R’, ‘▁v’, ‘▁Sp’, ‘en’, ‘cer’, ‘287’, ‘▁(’, ‘20’, ‘1’, ‘4’, ‘)’, ‘▁was’, ‘▁related’, ‘▁to’, ‘▁inform’, ‘ational’, ‘▁privacy’, ‘.’]
[‘▁The’, ‘▁further’, ‘▁details’, ‘▁of’, ‘▁the’, ‘▁chapter’, ‘▁are’, ‘▁not’, ‘▁necessary’, ‘▁for’, ‘▁our’, ‘▁purpose’, ‘.’]
Although the tokenisation is same, only the format is varying(i.e list vs string)
but this is causing issue in translation.
So once the model is trained what I am doing is I am encoding my test sentences using the same python package for sentencepiece (calling encode_as_pieces from sentencepiece class) which is again giving me output in list of subwords token format, which I am merging to get a format similar to what was in OpenNMT training sample (▁The ▁further ▁details ▁of ▁the ▁chapter ▁are ▁not ▁necessary ▁for ▁our ▁purpose .)
The output again is in this format (▁আর ভি ▁রবীন্দ্র ন ▁ও ▁কেএস ▁রাধাকৃষ্ণান)
To detoknise I again have to split them into list of subwords token (separated by space) and call sentencepiece DecodePieces funtion
But what is happening is lots of _ are left in decoded sentence (আরভি রবীন্দ্রন ও কেএস▁রাধাকৃষ্ণান)
Also I am not sure is this the right way, am I missing something as lots of sentencepiece specific preprocessing and post preprocessing has to be done to arrive at a particular format. however it was quite simpler in OpenNMT quickstart example.
Also wanted to add, the ctranslate2.translate_batch funtions also accpets “A list of list of string.”,
so for that also above manipulations have to be made to make it work. Am i doing something wrong or this is expected?