I was trying to translate a short sentence from English to Hindi.
The initial sentence was - ‘columbus is in ohio.’
After subwording it became - ‘▁columb us ▁is ▁in ▁o hi o .’
The translation I got after processing it though my model was - ‘▁कोलंब स ▁ओ ▁ही में ▁है ▁।’
Which when desubworded, turns into this text - ‘कोलंब स ओ ही में है ।’.
Here, the subwording models were created using 32K bpe. The translation labelled in (4) is quite close, but there are some ‘space’ characters which are creating the noise in the translation.
For example in the translation mentioned in (4), if कोलंब and स are combined then that would form कोलंबस which is the exact translation for the word columbus. Similarly, ▁ओ ▁ही when combined forms ओही which is not the exact transliteration of ohio but is very close (an acceptable translation).
Thus my question is about how exactly the desubwording part can be improved. I thought of initially going for a regular programming approach (finding patterns), but found that the above mentioned two cases clearly have different patterns to deal with, and would probably lead to merging of words which are not meant to be merged. Can something be done about it, since the translation (except the not-required tokenization) is very close to the gold sentence. I have noticed similar cases at other sentences where I was previously getting a unk, and currently not (currently I am getting this tokenized version).
I do not think the problem is in subwording/desubwording. The main issue here is that the sentence is too difficult for the NMT model.
I am just wondering why “।” is preceded by a space. I guess it should not, right? Did you use Moses Tokenizer for tokenization? If so, please do not use Moses Tokenizer for Hindi; either use SentencePiece directly, or use one of the Indic NLP tools.
You did not mention which tool you used for subwording. If it is SentencePiece, I believe it is better to send your questions in their GitHub issues to get more informative responses.
It was quite an obvious type issue, and was caused by me doing two iterations of sentencepiece on the dataset as was mentioned here.
I had first subworded the training files using the subwording script and then provided the output subworded dataset to the config file (the trainining corpus and validation set was the subworded one and not not the non-subworded one). After that there was another transform of sentencepiece on the subworded dataset. Due to this the output vocabulary had leading spaces on almost all of the dataset (due to the double tokenization), and thus the translation worked in the same manner.
I have now updated the training steps such that tokenization is only done once, the results as well have massively improved, and almost all of the unknown words are being correctly translated or transliterated.
Yes, it is either you subword the training and development datasets manually or use the on-the-fly transform, not both. Note though that the transform does not apply to the test dataset; so you have to subword it manually.
Although, one of the main drawbacks of the byte-level model is that the byte sequences are usually longer than original text sequences resulting in higher processing cost. As we know that the self-attention in transformers is a quadratic computation which poses a huge computational challenge when trying to process longer and longer sequences. Having said that, we do have advancements like Longformer, etc that make use of sparse attention and other clever techniques to handle the very large sequences.
Only a little experimenting, but I’m optimistic as a future strategy. I think you would need more powerful models for the sentences to not be too many tokens long, so currently it probably only makes sense for exceptionally difficult tokenization situations.