Single character tokenization?

Is single character tokenization recommended?

This type of issue would have been easier to avoid without a tokenizer:

Hi PJ!

You might want to try BPE Dropout, or Subword Sampling. I tried the former on a low-resource model, but the results were not better than the regular BPE. However, I was not sure if I used the best parameters.

If you try it, I will be interested in knowing the results.

Note that they are supported by SentencePiece and OpenNMT-py.

References:

2 Likes

Thanks for flagging this up :slight_smile:

Thanks for the info!

I think Argos Translate currently uses a Unigram Model with SentencePiece which as I understand it should do similar things to BPE Dropout.

I think that one of the main advantages of sending bytes directly to the seq2seq model would be that you wouldn’t need the complexity of a tokenizing system at all.

Character tokenization is nice and easy to setup, but it comes with practical challenges related to the increased sequence length. More sentences will be filtered by length during the training and decoding speed will be slower (remember that the default Transformer attention is quadratic to the sentence length).

1 Like

Visual explanation of Charformers:

2 Likes

Any idea of how to apply the output of charformer to OpenNMT transformer? Or It has to be a self-built transformer? Thx.

As I understand it the weights in the Charformer would need to be back propogated with the rest of the OpenNMT network.

Fastformers, or a similar architecture, scale linearly with the input length which makes long untokenized character sequence translations more viable.