So in the end I did as I mentioned above and the results seems perfect so far. For some of my african languages it’s a life saver.
Things to consider:
1) word 1 char long capitalized will be flagged with <ua> and not <u1>
2) you need to add 21 custom tokens in sentencePiece:
--user_defined_symbols=<ua>,<u1>,<u2>,<u3>,<u4>,<u5>,<u6>,<u7>,<u8>,<u9>,<u10>,<u11>,<u12>,<u13>,<u14>,<u15>,<u16>,<u17>,<u18>,<u19>,<u20>
feel free to add more if you believe you could have a word longer than 20 chars with partial capitalization.
3) I don't have weird capitalization in my dataset such as "EXAMPLE's" as one token, which would generate <u1> <u2> <u3> <u4> <u5> <u6> <u7> example's. In this case, you better tokenize EXAMPLE and 's in 2 tokens.
4) "P.H.D" will become "<ua> p.h.d" if P.H.D was 1 token.
If anyone needs it, here is the python code:
def caseFeatureFunction(text, action):
#initial
caseFeatureTokens = []
if action == 'tokenize':
tokens = text.split()
for token in tokens:
#if there is atleast 1 capital letter
if not token.islower():
#if all capitalized
if token.isupper():
#add tag for full capitalized
caseFeatureTokens.append('<ua>')
else:
#identify capital letters position.
capitalLettersPosition = [i for i, c in enumerate(token) if c.isupper()]
#add tag
for position in capitalLettersPosition:
caseFeatureTokens.append('<u' + str(int(position) + 1) + '>')
#add word lowercased
caseFeatureTokens.append(token.lower())
elif action == 'untokenize':
#remove space before capital tags <ua>
text = regex.sub(r'(<ua>) ', r'\1', text)
#remove space before capital tags <ui>
text = regex.sub(r'(<u[1-9]{1,2}>) ', r'\1', text)
#tranform into tokens
tokens = text.split()
for token in tokens:
#if there is a all capital tag
if '<ua>' in token:
#remove tag
token = token.replace('<ua>','')
#capitalize the whole word
token.upper()
elif '<u' in token:
#search the token position
tags = regex.findall('<u[1-9]{1,2}>', token)
#remove tags from token
for tag in tags:
token = token.replace(tag,'')
#capitalize the position specified in the tags
for tag in tags:
#remove everything that is not a number
tag = tag.replace('<u','').replace('>','')
#adjust the tag position as we did + 1 for readability at creation
position = int(tag) - 1
#capitalize the position
if position == 0:
token = token[0].upper() + token[1:len(token)]
elif position < len(token)-1:
token = token[0:position] + token[position].upper() + token[position+1:len(token)]
else:
token = token[0:position] + token[position].upper()
#add word lowercased
caseFeatureTokens.append(token)
return ' '.join(caseFeatureTokens)