Sentencepiece decoding

Hi

I understand this question should not be placed here, but maybe someone has solve it before. Excuse me if this is not the correct place.

I have created a sentencepiece model based in untokenized corpus.

pm_train --input=../corpus.txt --model_prefix=spces --vocab_size=50000 --character_coverage=1.0 --model_type=unigram

I have run opennmt-py and translate.py is returning many sentences like this:

 ▁Aportació n ▁de ▁la ▁corporació n ▁local : ▁213 . 566 ▁pta s .

When I try to decode the sentence (spm_decode) the result is:

▁Aportación de la▁corporación local: 213.566 ptas.

Not sure why the blank marker is not removed.

Just to try to find out where is the problem, I have run these commands (notice I have add a blank in Aportación):

echo "▁A portació n ▁de ▁la ▁corporació n" | spm_decode --model=spces.model

returns

 Aportación de la▁corporación

Or adding multiple spaces

echo "▁A portació n ▁de ▁la ▁c o r p o r a c i ó n" | spm_decode --model=spces.model

returns

Aportación de la corporación

So again not sure what is wrong here. The sentencepice readme states.

Since the whitespace is preserved in the segmented text, we can detokenize the text without any ambiguities.

  detokenized = ''.join(pieces).replace('_', ' ')

But it looks these simple rule is not followed here. Has aynyone has faced this before?

Thanks in advance
Have a nice day!

Miguel

Hi Miguel,

I’ve experienced the same kind of problem, although I can not tell you the source of your problem, I can tell you what works great for me

I use a tokenizer (Moses) before applying SentencePiece because it allows to separate properly tokens like “ptas.” which is in fact “ptas” and “.”

So instead of having

▁pta s .

you end up with

▁pta s ▁.

This way, deleting all spaces and then replacing ▁ with spaces will give you back a consistent corpus, the downside is that you will end up with spaces between last word and final point like:

ptas .

I’ve run identical NMT models (transformer_big), first using only SentencePiece and second using both tokenization+SentencePiece, BLEU results are significatively in favor of the latter

(assuming you’re working on translation too)

2 Likes

Hi Miguel,
The SentencePiece documentation states:
“* --model_type : model type. Choose from unigram (default), bpe , char , or word . The input sentence must be pretokenized when using word type.”
I used -model_type=bpe without pre-tokenization and I don’t get your problem when using the Docker implementation of OpenNMT-tf server.

1 Like

Hi Terence, thanks a lot for your comment.

Srry if I am missing something here. I also read the

--model_type : model type. Choose from unigram (default), bpe , char , or word . The input sentence must be pretokenized when using word type.

My interpretaton is that if you use --model_type=word, your input has to be pretokenized. I did not use the “word” type (but “unigram”), so I assume (maybe too happily) that my model is fine.

In order to solve my particular problem, I created a python program thar removes blanks, replaces the blank marker and ltrims the sentence to detokenize it:

import sys
fn=sys.argv[1]
with open(fn) as filename:
        for line in filename:
            line = line.replace(' ', '').replace('▁', ' ').lstrip(' ')
            sys.stdout.write (line)
sys.stdout.flush()
filename.close()

Hi Valentin,

Thanks a lot for your comment. Is quite interesting your post. I though that the one of the selling point of sentencepice is the ability to ingest raw corpus. But according your post if you tokenize + sentencepice your corpus you get better results that just sentencepiece. In fact, besides the blank marker, sentencepiece detokenized files are very similar to perl moses detokenized files (at least for spanish) .

If you dont mind, which model do you apply for sentence piece? (char/word/bep/unigram)

Have a nice day!
Miguel

Hi Miguel,

I also thought that SP was made to ingest raw corpus, and it does but I got better results with tokenization :slight_smile:

I tried with both unigram and bpe, I got significantly better results with bpe (still in my translation task en-de mesured by BLEU score).

Also, SentencePiece bpe provided me better results than Sennrich original implementation of bpe (still using the exact same model configurations to compare)

1 Like

Hi Valentin, I’m currently reviewing a (real-world) translation project and actually comparing a translation produced by a Transformer model built with SentencePiece bpe and one produced by a Lua OpenNMT model (without BPE)… The SentencePiece bpe trained model noticeably makes a much better job of reducing UNK’s and handling named entities.

Hi Terence,

It seems that we share similar results when using SentencePiece bpe. I’m curious to understand why this tool outperforms other approaches

HI valentin
Do we also have to decode using senetencepiece the final translated sentence, before calculating bleu score? And if I have used pretokenization, do i need to first detokenize then decode using senetencepiece then calulate bleu score?
Also, I am using the python wrapper to encode using sentencepiece. My training corpus after that is becoming something like this :
[‘▁अपने’, ‘▁अनुप्रयोग’, ‘▁को’, ‘▁पहुंच’, ‘नीय’, ‘ता’, ‘▁व्या’, ‘याम’, ‘▁का’, ‘▁लाभ’, ‘▁दें’]
[‘▁एक्’, ‘से’, ‘र्’, ‘साइ’, ‘सर’, ‘▁पहुंच’, ‘नीय’, ‘ता’, ‘▁अन्वे’, ‘षक’]

is it correct , is this how feed into my transformer model? And do we need to train senetencepiece model on concatenated parallel corpus or individually for source and target?
I am translating hindi to english parallel corpus

Hi ajitesh,

Yes you need to decode your translated sentence. However I do not detokenize my corpus before calculating BLEU, see Jean Senellart answer here Detokenization clarification

Yes your training corpus seems correct, having a ‘▁’ before your word means it is a new word while having nothing means it is a subpart of a word.

I don’t know if you can feed your model like that, but for sure you can ' '.join([‘▁एक्’, ‘से’, ...) to put your text in a file and feed it directly into your model

Also to delete SentencePiece from your output file you could use something like:
cat your_file | perl -pe 's/ //g' | sed "s/▁/ /g" | sed "s/^ //g" > output

I trained my SentencePiece models on source and target individually

Thank You valentin,will try this. I am also trying with subword(BPE).

No problem, could you tell us if you get better results with SentencePiece than with Subword ? It is my case and it would be nice to have other results

Sure, I will post my results once my training is finished

Hello! As you’re talking about SentencePiece BPE and Subword-NMT BPE comparison, I’d like to share the results of my little experiment with these tools.

I trained two identical Transformer base models for EN-RU pair on corpora with 6 millions sentences preprocessed by SentencePiece and by Subword-NMT. For Subword-NMT’s learn_bpe.py script I set 60000 merge operations and that resulted in ~63000 vocabulary size. For SentencePiece’s vocab_size parameter I set 63000.

I tested models on the newstest2017 and got following results.

Tool BLEU (1st epoch) BLEU (2nd epoch) Validation PPL
Subword-NMT 24.8 28.6 6.69
SentencePiece 23.5 26.3 7.25

BLEU were evaluated with SacreBLEU tool on detokenized sentences with “case-insensitive” flag.
As I have very limited available time to use a GPU, I trained both models only for 2 epochs. So I’m not sure how representative that results are, but I choosed the model with Subword-NMT for further training.

Hi,
interesting. did you use a shared vocab for both EN and RU ?
what mode did you use for Sentence Piece ?

Hi, @vince62s! No, I didn’t shared vocab. For SentencePiece I used BPE. I forgot to mention, that I run spm_train for raw sentences, and with Subword-NMT I used tokenizer from Moses.

So this is the main reason for your difference in BLEU.

1 Like

Do you mean feeding SentencePiece with raw sentences? I did it because, it’s readme says that it’s a proper way to work with the tool in BPE mode. Unfortunately I read @valentinmace’s suggestion about pretokenizing sentences too late, after training. It would be interesting to try that.

For anyone still interested in this. Tokenization before using sentencepiece does in fact improve performance as documented here in the experiments: https://github.com/google/sentencepiece/blob/master/doc/experiments.md.

3 Likes

Hi Miguel, It may be more than two years later but I came across your solution (to a SentencePiece decode problem I only had in Windows, not on Linux) and it did the job :slight_smile: Terence