Single character tokenization?

argosopentech · May 31, 2021, 1:19pm

Is single character tokenization recommended?

argosopentech · May 31, 2021, 1:26pm

This type of issue would have been easier to avoid without a tokenizer:

Uppercase in word lets translation fail · Issue #20 · uav4geo/LibreTranslate · GitHub

ymoslem · June 1, 2021, 7:11am

Hi PJ!

You might want to try BPE Dropout, or Subword Sampling. I tried the former on a low-resource model, but the results were not better than the regular BPE. However, I was not sure if I used the best parameters.

If you try it, I will be interested in knowing the results.

Note that they are supported by SentencePiece and OpenNMT-py.

References:

Subword Techniques for Neural Machine Translation | by Rashmini Naranpanawa | Analytics Vidhya | Medium
[1804.10959] Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
[1910.13267] BPE-Dropout: Simple and Effective Subword Regularization

tel34 · June 1, 2021, 9:37am

Thanks for flagging this up

argosopentech · June 1, 2021, 11:38am

Thanks for the info!

I think Argos Translate currently uses a Unigram Model with SentencePiece which as I understand it should do similar things to BPE Dropout.

I think that one of the main advantages of sending bytes directly to the seq2seq model would be that you wouldn’t need the complexity of a tokenizing system at all.

guillaumekln · June 21, 2021, 5:46pm

Character tokenization is nice and easy to setup, but it comes with practical challenges related to the increased sequence length. More sentences will be filtered by length during the training and decoding speed will be slower (remember that the default Transformer attention is quadratic to the sentence length).

argosopentech · June 26, 2021, 1:42pm

argosopentech · June 29, 2021, 10:30pm

Visual explanation of Charformers:

4topia · July 4, 2021, 9:02am

Any idea of how to apply the output of charformer to OpenNMT transformer? Or It has to be a self-built transformer? Thx.

argosopentech · July 4, 2021, 4:09pm

As I understand it the weights in the Charformer would need to be back propogated with the rest of the OpenNMT network.

argosopentech · August 26, 2021, 3:02am

Fastformers, or a similar architecture, scale linearly with the input length which makes long untokenized character sequence translations more viable.