Using Sentencepiece/Byte Pair Encoding on Model

Dear Gurjot,

Transforms run some preprocessing on the fly, i.e. during the training time. In this case, the SentencePiece transform uses the input SentencePiece model, and sub-words training data at the training time. Hence, you have to create this SentencePiece model and provide its path in the config file.

So the only step that the transform does is sub-wording using this model.

I feel that training for 400000 steps for only 2.5 million sentences is too much training. You can use Early Stopping, (e.g. early_stopping: 6).

All the best,
Yasmin

1 Like

Hello Yasmin, Thanks for the quick reply!!!
Really appreciate it, I have Some more questions that i would like to ask.

1.) if vocab size is limited to 50k (lets say using min frequency parameter) what will happen to words which are not present in vocab but are present in training sentences.
So will the model learn about these words? As there are instances of these words in the training sentences …
What if these words are seen in test data again, will it result in unk tokens?
Is sentencepiece only viable practical option here?

2.) i tried sentencepiece and filtertoolong params in data field that you had mentioned but that resulted in TypeError not a string. Can you explain this in a little detail with some guide if possible? Would be really thankful about it!

3.) for 2.5M sentences if i use batch size of 16 then : 25,00,000/16 = 1,56,250 steps for model to train on whole training data.
So if i use 400k training steps it means ~2.5 epochs on whole training data(2.5Million sents)?
Is the match correct?

4.) how would i be able to calculate the above fields if batch type is tokens? Let’s say batch size is 4096
I think calculating these fields might not be possible perhaps estimating them??

5.) I had trained a word2vec model on punjabi language with ~35Million sentences which i am using as target embedding but this didn’t seem to affect MT accuracy ( instead of the fasttext model i was using which is much lighter ) infact had to reduce batchsize so that I don’t run out of VRAM with embeddings loaded
Is it because the transformer is also creating encoding while training of the text?
Or this is an unexpected behaviour?

They will be considered UNKs.

Sub-wording helps reduce UNKs. The size of the data is also an important factor.

At the translation time, a replace_unknowns option (e.g. in CTranslate2) can try to copy the source token into the target. If you train the SentencePiece model with the option --byte_fallback, this can improve the copying behaviour.

Then, I would suggest you stick with manual sub-wording for now, i.e. fully preparing your data and sub-wording it with SentencePiece, before using in OpenNMT-py, and removing the sentencepiece option.

It would be too slow. If you have batch_type: tokens, you should try batch_size: 4096, 2048, or 1024.

This depends on whether you have batch_type: tokens or batch_type: examples. Also, please search the forum for “accum_count” mentions by @francoishernandez like this one.

When I was working on Hindi, I was told that using external embeddings would not have much effect, so I did not try myself. What can help more is using back-translation, as illustrated here.

I hope this answers your questions. If you have more questions, please start a new topic for them; this would give your questions more exposure and allow others to give you their input as well.

All the best,
Yasmin

2 Likes