I was trying to translate a short sentence from English to Hindi.
- The initial sentence was - ‘columbus is in ohio.’
- After subwording it became - ‘▁columb us ▁is ▁in ▁o hi o .’
- The translation I got after processing it though my model was - ‘▁कोलंब स ▁ओ ▁ही में ▁है ▁।’
- Which when desubworded, turns into this text - ‘कोलंब स ओ ही में है ।’.
Here, the subwording models were created using 32K bpe. The translation labelled in (4) is quite close, but there are some ‘space’ characters which are creating the noise in the translation.
For example in the translation mentioned in (4), if कोलंब and स are combined then that would form कोलंबस which is the exact translation for the word columbus. Similarly, ▁ओ ▁ही when combined forms ओही which is not the exact transliteration of ohio but is very close (an acceptable translation).
Thus my question is about how exactly the desubwording part can be improved. I thought of initially going for a regular programming approach (finding patterns), but found that the above mentioned two cases clearly have different patterns to deal with, and would probably lead to merging of words which are not meant to be merged. Can something be done about it, since the translation (except the not-required tokenization) is very close to the gold sentence. I have noticed similar cases at other sentences where I was previously getting a unk, and currently not (currently I am getting this tokenized version).