SentencePiece + pyonmttok

Hello,

I’m trying to use sentencePiece, but at the same time use the features from the tokenization (pyonmttok) such as case_markup.

Can I simply start by using pyonmttok.Tokenizer with what ever options I want and then apply sentencePiece to the resulting file? Then after inference, I would remove all the “_” from sentencePiece and untokenize the results to restore the capital letters and what ever tokenization that was applied.

Please feel free to let me know if my understanding is correct or if there is any better way to do this.

My question apply for both Opennmt-Py/Opennmt-TF

After some thought, I figured out that my plan mentioned above couldn’t work. So my second alternative will be to create 4 customs tag in sentencePiece.

Custom tags:
1) <UA> (upper all) = the whole word is uppercased
2) <UF2> (upper first 2 letters) =  2 first letters are uppercased
3) <UF1> (upper first letter) = first letter is uppercased
4) <US> (upper some letter(s)) = "else" an uppercase is present  but doesn't fit the 3 first category

These tokens will be placed before the words that contain uppercase(s) before calling sentencePiece and then the text will be lowercased by sentencePiece. During untokenization, I will be able to restore 1,2,3 through an algorithm and 4 will be throught a specific mapping list (rare cases).

wish I could have created dynamic tags… I would just have needed one which specify the position of the uppercase chars.

I think it is a nice approach. You can almost get rid of the annoying case issue. Other solutions usually involve create a painful case model based on a monocorpus. After that, all your bicorpus can be analyzed in lowercase, and neuronal network can easily learn casing based on your markup. Then you can indicate to sentencepice to protect your tagging.

I have tried similar approach, but just with 2 tags, ((up)) for first letter uppercase and ((aup)) for all upper. I like your idea UFn, but makes this a little bit more complicated, and not sure if pays off. Is a simple and effective way as far I have seen.

Hope this helps. Have a nice day!

1 Like

Thanks for confirming this has been tried before!

I understand why you would use 2 tags. In my case for some languages i’m dealiong with, I often have words that have a 1 upper case letter in the middle of the word. So I’m still debating if i should creat 1 tag when the fist letter is upper cased and 1 that specify the position of the other uppercase in the word.

Ex1:
THisistheword
<U1> <U2> thisistheword

Ex2:
ThIsistheword
<U1> <U3> thisistheword

Ex3:
ThiSistheword
<U1> <U4> thisistheword

Ex4:
thiSistheword
<U4> thisistheword

Ex5:
Thisistheword
<U1> thisistheword

Not sure what to say. If the casing is related to the meaning of subwords, probably i would think some sort of tokenizing should be also consider. Probably someone with german translation experiencie could help you as they have compound names that resembles your case.

So in the end I did as I mentioned above and the results seems perfect so far. For some of my african languages it’s a life saver.

Things to consider:


1) word 1 char long capitalized will be flagged with <ua> and not <u1>
2) you need to add 21 custom tokens in sentencePiece:
--user_defined_symbols=<ua>,<u1>,<u2>,<u3>,<u4>,<u5>,<u6>,<u7>,<u8>,<u9>,<u10>,<u11>,<u12>,<u13>,<u14>,<u15>,<u16>,<u17>,<u18>,<u19>,<u20>
feel free to add more if you believe you could have a word longer than 20 chars with partial capitalization.
3) I don't have weird capitalization in my dataset such as "EXAMPLE's" as one token, which would generate <u1> <u2> <u3> <u4> <u5> <u6> <u7> example's. In this case, you better tokenize EXAMPLE and 's in 2 tokens.
4) "P.H.D" will become "<ua> p.h.d" if P.H.D was 1 token.

If anyone needs it, here is the python code:

def caseFeatureFunction(text, action):
    
    #initial
    caseFeatureTokens = []
    
    if action == 'tokenize':
        tokens = text.split()
        
        for token in tokens:
        
            #if there is atleast 1 capital letter
            if not token.islower():
                #if all capitalized
                if token.isupper():
                    #add tag for full capitalized
                    caseFeatureTokens.append('<ua>')
                else: 
                    #identify capital letters position.
                    capitalLettersPosition = [i for i, c in enumerate(token) if c.isupper()]
                    
                    #add tag
                    for position in capitalLettersPosition:
                        caseFeatureTokens.append('<u' + str(int(position) + 1) + '>')
                    
            #add word lowercased
            caseFeatureTokens.append(token.lower())
    elif action == 'untokenize':
        #remove space before capital tags <ua>
        text = regex.sub(r'(<ua>) ', r'\1', text)
        
        #remove space before capital tags <ui>
        text = regex.sub(r'(<u[1-9]{1,2}>) ', r'\1', text)
        
        #tranform into tokens
        tokens = text.split()
        
        for token in tokens:
        
            #if there is a all capital tag
            if '<ua>' in token:
                #remove tag
                token = token.replace('<ua>','')
                #capitalize the whole word
                token.upper()
            elif '<u' in token:
                #search the token position
                tags = regex.findall('<u[1-9]{1,2}>', token)
                
                #remove tags from token
                for tag in tags:
                    token = token.replace(tag,'')
                
                #capitalize the position specified in the tags
                for tag in tags:
                    #remove everything that is not a number
                    tag = tag.replace('<u','').replace('>','')
                    #adjust the tag position as we did + 1 for readability at creation
                    position = int(tag) - 1 
                    #capitalize the position
                    if position == 0:
                        token = token[0].upper() + token[1:len(token)]
                    elif position < len(token)-1:
                        token = token[0:position] + token[position].upper() + token[position+1:len(token)]
                    else:
                        token = token[0:position] + token[position].upper()
                    
            #add word lowercased
            caseFeatureTokens.append(token)  
    return ' '.join(caseFeatureTokens)