I realised that I don’t quite understand what is token batching in comparison with the more conventional batching? Typically we have sequences of sentences batched together to form a batch. But for token batching, eg 8 token batch size, then what is the length of the sequences of the tokens that are batched together?
Also, why is token batching used/recommended? I don’t quite understand why as I think that by defining a batch by tokens, we could be splitting a sentence up and hence the connections between words in a sentence may be broken? And as such the language model would be able to learn optimally?
Token batching doesn’t split up the sentences itself I believe. It just tries to find the amount of tokens closest to the token batch size you set, that is also a multiple of 8 (traditionally).
Token batching is ‘better’ in this case because it keeps the size of batches more standard. A batch size of 128 sentences would take 128 sentences, regardless of their size. So what can easily happen is the number of tokens in each batch can be wildly different. Token batching fixes that.
Hi Vincent, thank you so much for the tutorial!
I’m trying to finetune NLLB-200 3.3B using LoRa and the training works but when I try to translate some simple sentences then I get “” for all the sentences.
These are my config files for training and inference:
training
When you start logging the ACC/PPL (don’t wait 65000 steps) check that ACC is already very high and PPL low.
if you have a doubt, post the log here of the first 2000 steps
Hi Vencent,
Thanks a lot for the great work. I need to finetuning the model from EN to multiple languages, e.g. ZH, FR, DE and PT in a sepcific domain, can I list all the language pair in the nllb-train.yaml file (see bellow)? Before the training, do I need to do any pre-process job to the data? (such as apply sentencepiece model to the corpus)
enzh:
path_src: “/en-zh/train.en”
path_tgt: “/en-zh/train.zh”
transforms: [sentencepiece, prefix, suffix, filtertoolong]
weight: 10
src_prefix: “eng_Latn”
tgt_prefix: “zho_Hans”
src_suffix: “”
tgt_suffix: “”
enfr:
path_src: “/en-fr/train.en”
path_tgt: “/en-fr/train.fr”
transforms: [sentencepiece, prefix, suffix, filtertoolong]
weight: 10
src_prefix: “eng_Latn”
tgt_prefix: “fra_Latn”
src_suffix: “”
tgt_suffix: “”