Hi
I understand this question should not be placed here, but maybe someone has solve it before. Excuse me if this is not the correct place.
I have created a sentencepiece model based in untokenized corpus.
pm_train --input=../corpus.txt --model_prefix=spces --vocab_size=50000 --character_coverage=1.0 --model_type=unigram
I have run opennmt-py and translate.py is returning many sentences like this:
▁Aportació n ▁de ▁la ▁corporació n ▁local : ▁213 . 566 ▁pta s .
When I try to decode the sentence (spm_decode) the result is:
▁Aportación de la▁corporación local: 213.566 ptas.
Not sure why the blank marker is not removed.
Just to try to find out where is the problem, I have run these commands (notice I have add a blank in Aportación):
echo "▁A portació n ▁de ▁la ▁corporació n" | spm_decode --model=spces.model
returns
Aportación de la▁corporación
Or adding multiple spaces
echo "▁A portació n ▁de ▁la ▁c o r p o r a c i ó n" | spm_decode --model=spces.model
returns
Aportación de la corporación
So again not sure what is wrong here. The sentencepice readme states.
Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities.
detokenized = ''.join(pieces).replace('_', ' ')
But it looks these simple rule is not followed here. Has aynyone has faced this before?
Thanks in advance
Have a nice day!
Miguel