Using Sentencepiece/Byte Pair Encoding on Model

Hi Yasmin, Thanks for posting the configuration. It’s good to note that you are using the sentencepiece transform whereas I am using onmt_tokenize. You are correct in pointing out that I am using a simple vanilla NMT implementation. At present I am trying to replicate results in a paper that uses a vanilla approach so I am restricted to using simpler models for that particular experiment. However, I might move on from that and now concentrate on a transformer architecture. Using transformers, I have also seen significant improvement in Bleu scores. I have a pro version of Colab which is working great. The standard version is too slow for building these types of models.

Regards,

Séamus.

Hello, I’m going to jump in this conversation.
I am training on a low-resource setting, things I noticed, that could be beneficial.

  1. Using RelativeTransformer model instead of Transformer. (+1 BLEU)
  2. Word dictionary. (+1 BLEU)
  3. Keep long sentences and paragraphs. (+6 BLEU)
  4. Back translation. (+1 BLEU)
  5. Copy monolingual target text to source.
  6. Tagged features (i.e <ru> <bt> <v2> ▁I ▁love ▁music .)
  7. Average the last two models, then continue training with the averaged model. (+0.5 BLEU)
  8. Train variance of sentencepiece versions for a single NMT model, with shared vocab by adding all the sentence piece tokens of all sp versions on both the source and target side in one vocab file. (+1 BLEU)
    (i.e)
    v1 vocab size:28500, maximum piece length: 14
    v2 vocab size:25000, maximum piece length: 12
    v3 vocab size:22000, maximum piece length: 10
    v4 vocab size:19000, maximum piece length: 8
    v5 vocab size:16000, maximum piece length: 6

If something needs to be more clear, I could share further details.

4 Likes

Dear Nart,

Many thanks for sharing your tips when it comes to low-resource languages! Sure, changing BPE settings can help in some cases.

I would be grateful if you can share more details about the first point, using the “Relative Transformer”. What are the different parameters between it and the standard “Transformer”?

Many thanks!
Yasmin

@ymoslem I’m using the relative transformer model that comes out of the box with openNMT-tf:

onmt-main --model_type TransformerRelative --config data.yml --auto_config train

The implementation seems to correspond to this paper: [1803.02155] Self-Attention with Relative Position Representations
It’s the last item in this list: Model — OpenNMT-tf 2.16.0 documentation

1 Like

Many thanks, Nart, for the detailed answer!

My understanding is that in OpenNMT-py position_encoding: 'true' achieves the same purpose. Could you please confirm or correct me, François. Thanks! @francoishernandez

Kind regards,
Yasmin

1 Like

Positional encoding and relative positions representations are not the same thing.
Positional encoding, or position_encoding is the original Transformer architecture (section 3.5 of Attention is all you need.
Relative positions representations were introduced a bit later, in the paper that @Nart mentions. This was implemented in OpenNMT-py and can be enabled by setting max_relative_positions.

2 Likes

Some interesting tips there :-). Am also doing some “low resource work”.

1 Like

I am curious, what language are you working on?

I am not sure, why I cant update my previous post anymore!
Back translation first iteration. (+4.5 BLEU)

Hi @Nart,

Many thanks for sharing your insight and hints.

I’m particularly curious about this one:

Is there a paper describing this technique? Wouldn’t the final vocabulary be too big (after adding all tokens for both languages) for a low-resource setting? I guess the effect should be more or less similar to applying BPE dropout, have you tried it too?

My answers are all speculative, no concrete analysis or experiments yet.

I have not tried it, I read the description of the paper, the drop out seems to be done randomly so the model is affected by the bad and the good signal, in my case the bpe segmentation is still unique to each model, but aware of other possible segmentations.

The vocab is slightly bigger than what would’ve been if I only used the largest model (all models vocab: 62681) (largest model vocab: 57000), afterwards you can pick one of the models with the best BLEU score and reduce the model’s vocab size just for that model. (vocab_update)

This is based on my own experiments, no papers published.

Hi @Nart, Just spotted your question as I didn’t get a notification. I’m focussing on Tagalog (Filipino)-English.

How is your progress? What strategy are you using?

I’ve described this work at some length in various posts in the Facebook Machine Translation Group. It started with SMT, then various small NMT models, then back-translation, then a Transformer model.

1 Like

I have some questions regarding the way you determine the BLEU score.

  1. Did you check all these individually? Could there be a chance that doing (Keep long sentences and paragraphs) reduce considerably the gains from the other ones?

  2. Your BLUE score was determine based on a completly independent test set and the same kind of data? I’m asking because I’m doing some similar test and I use both the paragraphs and the sentences, plus the chunked sentences. I wasn’t sure of the best way to build my script in order to guarantee my test set is completly independant and yet not trash any data. As i’m doing lots of filtering in the data and if i apply these filters on paragraph level… I will lose lots of data!
    An another challenge is when i include Full sentences and chunks of sentences… i need to compare my BLEU score on the same kind of data as before splitting them. You usually get a much better score when the sentences are smaller which wouldn’t be a good comparison.

I just did a test where I keep Long sentences plus, when Source and Target have exacly the same ponctuation (.!?,:;), I break thos Long sentences into small chunks base on the punctuation. The results seem to be significantly better, but because of the reasons mentionned above I did not put so much time to figure out how to compare the BLEU scores…

You did a good job too, I enjoyed the results :slight_smile:

Hello @SamuelLacombe

I have not analyzed those independently, this definitely needs fine tuning and individual testing to figure out what works best.

Yes, the testing data is not part of the training data, but it’s in-domain.

True.

I get the BLEU scores of validation/test data of each model (5 models in a single NMT model), BLEU score seems to be a relative metric, but can be a good indicator wither the models are heading to the right direction.

BLEU-0: Tokenized validation data
BLEU-1: Detokenized validation data
BLEU-2: Tokenized test data
BLEU-3: Detokenized test data

28500 v6/src-vocab.txt
BLEU-0 = 34.76, 61.1/41.1/31.6/25.8 (BP=0.919, ratio=0.922, hyp_len=57572, ref_len=62445)
BLEU-1 = 27.46, 53.0/33.0/23.3/17.7 (BP=0.942, ratio=0.944, hyp_len=37347, ref_len=39568)
BLEU-2 = 36.32, 62.8/43.3/33.4/27.6 (BP=0.913, ratio=0.916, hyp_len=30255, ref_len=33015)
BLEU-3 = 29.16, 55.1/35.0/25.2/19.6 (BP=0.934, ratio=0.936, hyp_len=19841, ref_len=21200)

25000 v7/src-vocab.txt
BLEU-0 = 35.27, 61.2/41.6/32.2/26.3 (BP=0.920, ratio=0.923, hyp_len=59325, ref_len=64243)
BLEU-1 = 27.45, 52.9/32.8/23.3/17.8 (BP=0.943, ratio=0.945, hyp_len=37372, ref_len=39568)
BLEU-2 = 36.37, 62.4/43.2/33.5/27.6 (BP=0.916, ratio=0.919, hyp_len=31232, ref_len=33973)
BLEU-3 = 28.82, 54.6/34.7/24.9/19.3 (BP=0.933, ratio=0.935, hyp_len=19822, ref_len=21200)

22000 v8/src-vocab.txt
BLEU-0 = 36.14, 61.6/42.7/33.0/26.9 (BP=0.925, ratio=0.927, hyp_len=62436, ref_len=67324)
BLEU-1 = 27.20, 52.9/32.8/23.1/17.5 (BP=0.940, ratio=0.942, hyp_len=37255, ref_len=39568)
BLEU-2 = 36.61, 62.4/43.5/33.6/27.4 (BP=0.921, ratio=0.924, hyp_len=32956, ref_len=35675)
BLEU-3 = 27.98, 54.0/33.8/23.9/18.2 (BP=0.937, ratio=0.939, hyp_len=19909, ref_len=21200)

19000 v9/src-vocab.txt
BLEU-0 = 37.18, 61.7/43.6/33.8/27.5 (BP=0.935, ratio=0.937, hyp_len=68244, ref_len=72811)
BLEU-1 = 26.67, 52.3/32.0/22.4/16.8 (BP=0.946, ratio=0.948, hyp_len=37501, ref_len=39568)
BLEU-2 = 37.92, 62.6/44.7/34.7/28.3 (BP=0.931, ratio=0.934, hyp_len=36004, ref_len=38564)
BLEU-3 = 27.57, 53.6/33.3/23.5/17.7 (BP=0.938, ratio=0.940, hyp_len=19934, ref_len=21200)

16000 v10/src-vocab.txt
BLEU-0 = 38.48, 61.8/44.8/35.0/28.3 (BP=0.946, ratio=0.947, hyp_len=79756, ref_len=84217)
BLEU-1 = 24.92, 50.8/30.3/20.6/15.1 (BP=0.947, ratio=0.949, hyp_len=37537, ref_len=39568)
BLEU-2 = 39.34, 63.1/46.2/36.2/29.3 (BP=0.939, ratio=0.941, hyp_len=42260, ref_len=44924)
BLEU-3 = 25.53, 52.1/31.4/21.3/15.6 (BP=0.939, ratio=0.941, hyp_len=19948, ref_len=21200)

Couple of questions

  1. What does transforms does in data? I see you have mentioned sentencepiece, so during omnt build vocab does it automatically use sentencepiece to generate the source and target vocab?
  2. Regarding training time
    My config/paramters are-
    Training data - 2.5 Million Sentence Pairs
    Transformer architecture
    Training on 2 GPU’s(3080’s)
    Layers:4
    Heads:6
    Batch_type: tokens
    Batch_size: 4096
    Train_steps: 400000
    Valid_steps: 1000
    Src vocab size: 54k
    Tgt vocab size: 61k

The number of parameters comes out to be ~67M
During training I’m getting 16-18k source tokens/sec and 18-20k target tokens/sec
Also along with this I’m getting a value in seconds which I believe is the total training time passed in seconds. Using this I’m evaluating the training time.
For 13k steps it took 4.34 hours which seems alot to me as for 400k steps it will take ~6 days approximately
Does this seem appropriate to you or I’m doing something wrong here?

Also at 13k steps, it has loaded up-till weighted corpora 8 only (not sure about the total number but i would expect it to be in more than 500 if not in thousands) and achieved training accuracy of 54.73, ppl: 6.03, xent: 1.80 and for validation accuracy: 56.8, ppl: 9.25.
The accuracy seems weirdly low to me even though it has seen very small part of dataset. Does this seem good to you?

Any help/feedback would be highly appreciated !! As it would be bummer if I found out i was doing something wrong after the training has finished which takes like a whole week.

Dear Gurjot,

Transforms run some preprocessing on the fly, i.e. during the training time. In this case, the SentencePiece transform uses the input SentencePiece model, and sub-words training data at the training time. Hence, you have to create this SentencePiece model and provide its path in the config file.

So the only step that the transform does is sub-wording using this model.

I feel that training for 400000 steps for only 2.5 million sentences is too much training. You can use Early Stopping, (e.g. early_stopping: 6).

All the best,
Yasmin

1 Like

Hello Yasmin, Thanks for the quick reply!!!
Really appreciate it, I have Some more questions that i would like to ask.

1.) if vocab size is limited to 50k (lets say using min frequency parameter) what will happen to words which are not present in vocab but are present in training sentences.
So will the model learn about these words? As there are instances of these words in the training sentences …
What if these words are seen in test data again, will it result in unk tokens?
Is sentencepiece only viable practical option here?

2.) i tried sentencepiece and filtertoolong params in data field that you had mentioned but that resulted in TypeError not a string. Can you explain this in a little detail with some guide if possible? Would be really thankful about it!

3.) for 2.5M sentences if i use batch size of 16 then : 25,00,000/16 = 1,56,250 steps for model to train on whole training data.
So if i use 400k training steps it means ~2.5 epochs on whole training data(2.5Million sents)?
Is the match correct?

4.) how would i be able to calculate the above fields if batch type is tokens? Let’s say batch size is 4096
I think calculating these fields might not be possible perhaps estimating them??

5.) I had trained a word2vec model on punjabi language with ~35Million sentences which i am using as target embedding but this didn’t seem to affect MT accuracy ( instead of the fasttext model i was using which is much lighter ) infact had to reduce batchsize so that I don’t run out of VRAM with embeddings loaded
Is it because the transformer is also creating encoding while training of the text?
Or this is an unexpected behaviour?

They will be considered UNKs.

Sub-wording helps reduce UNKs. The size of the data is also an important factor.

At the translation time, a replace_unknowns option (e.g. in CTranslate2) can try to copy the source token into the target. If you train the SentencePiece model with the option --byte_fallback, this can improve the copying behaviour.

Then, I would suggest you stick with manual sub-wording for now, i.e. fully preparing your data and sub-wording it with SentencePiece, before using in OpenNMT-py, and removing the sentencepiece option.

It would be too slow. If you have batch_type: tokens, you should try batch_size: 4096, 2048, or 1024.

This depends on whether you have batch_type: tokens or batch_type: examples. Also, please search the forum for “accum_count” mentions by @francoishernandez like this one.

When I was working on Hindi, I was told that using external embeddings would not have much effect, so I did not try myself. What can help more is using back-translation, as illustrated here.

I hope this answers your questions. If you have more questions, please start a new topic for them; this would give your questions more exposure and allow others to give you their input as well.

All the best,
Yasmin

2 Likes