hey @francoishernandez
I was looking at the sample generated before training begins. I could see one difference.
After tokenisation I am getting sentence as below:
▁Per usal ▁of ▁the ▁file ▁shows ▁that ▁a ▁counter ▁affidavit ▁was ▁filed ▁by ▁respondent ▁No . ▁ 2 ▁under ▁index ▁dated ▁1 7 . ▁ 05 . ▁ 2000 , ▁wherein ▁the ▁impugned ▁order ▁dated ▁ 27 . ▁1 0 . ▁1 9 9 9 ▁was ▁sought ▁to ▁be ▁supported .
▁The ▁decision ▁in ▁R ▁v ▁Sp en cer 287 ▁( 20 1 4 ) ▁was ▁related ▁to ▁inform ational ▁privacy .
▁The ▁further ▁details ▁of ▁the ▁chapter ▁are ▁not ▁necessary ▁for ▁our ▁purpose .
However for the same corpus and sentencepiece model if I use sp.encode or sp.encode_as_pieces I am getting tokenised test in below format:
[‘▁Per’, ‘usal’, ‘▁of’, ‘▁the’, ‘▁file’, ‘▁shows’, ‘▁that’, ‘▁a’, ‘▁counter’, ‘▁affidavit’, ‘▁was’, ‘▁filed’, ‘▁by’, ‘▁respondent’, ‘▁No’, ‘.’, ‘▁’, ‘2’, ‘▁under’, ‘▁index’, ‘▁dated’, ‘▁1’, ‘7’, ‘.’, ‘▁’, ‘05’, ‘.’, ‘▁’, ‘2000’, ‘,’, ‘▁wherein’, ‘▁the’, ‘▁impugned’, ‘▁order’, ‘▁dated’, ‘▁’, ‘27’, ‘.’, ‘▁1’, ‘0’, ‘.’, ‘▁1’, ‘9’, ‘9’, ‘9’, ‘▁was’, ‘▁sought’, ‘▁to’, ‘▁be’, ‘▁supported’, ‘.’]
[‘▁The’, ‘▁decision’, ‘▁in’, ‘▁R’, ‘▁v’, ‘▁Sp’, ‘en’, ‘cer’, ‘287’, ‘▁(’, ‘20’, ‘1’, ‘4’, ‘)’, ‘▁was’, ‘▁related’, ‘▁to’, ‘▁inform’, ‘ational’, ‘▁privacy’, ‘.’]
[‘▁The’, ‘▁further’, ‘▁details’, ‘▁of’, ‘▁the’, ‘▁chapter’, ‘▁are’, ‘▁not’, ‘▁necessary’, ‘▁for’, ‘▁our’, ‘▁purpose’, ‘.’]
What is the reason for this difference? I am refering to sentencepiece github python wrapper page sentencepiece/README.md at master · google/sentencepiece · GitHub
Previously I have been using python based package of sentencepiece and used to call encode function which gave me tokenised output in the format as shown later.