Thank you so much! It worked! How may I retrieve such information about the model architecture if I have to finetune another model checkpoint in the future?
Now I get another error message
(/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ')
that seems to be because I was loading too much data at once on my collab notebook, so I think I’ll reduce the amount of data that use (600K) to around 300K
I saw that you used 341K lines so I tried using 300K lines of training data and subsequently 10 lines of training data. However, I was still thrown the same error, even after reduce the batch size to 1. I am using Google collab that provides around 12GB System RAM & 15GB GPU RAM.
(/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ')
My google search attributes the error message to running out of memory (my system ram seems to be the problem, not the GPU ram) but I’m not sure what else I can change. Currently my batching and optimisation configuration is as follows:
I realised that I don’t quite understand what is token batching in comparison with the more conventional batching? Typically we have sequences of sentences batched together to form a batch. But for token batching, eg 8 token batch size, then what is the length of the sequences of the tokens that are batched together?
Also, why is token batching used/recommended? I don’t quite understand why as I think that by defining a batch by tokens, we could be splitting a sentence up and hence the connections between words in a sentence may be broken? And as such the language model would be able to learn optimally?
Token batching doesn’t split up the sentences itself I believe. It just tries to find the amount of tokens closest to the token batch size you set, that is also a multiple of 8 (traditionally).
Token batching is ‘better’ in this case because it keeps the size of batches more standard. A batch size of 128 sentences would take 128 sentences, regardless of their size. So what can easily happen is the number of tokens in each batch can be wildly different. Token batching fixes that.
Hi Vincent, thank you so much for the tutorial!
I’m trying to finetune NLLB-200 3.3B using LoRa and the training works but when I try to translate some simple sentences then I get “” for all the sentences.
These are my config files for training and inference:
training
When you start logging the ACC/PPL (don’t wait 65000 steps) check that ACC is already very high and PPL low.
if you have a doubt, post the log here of the first 2000 steps
Hi Vencent,
Thanks a lot for the great work. I need to finetuning the model from EN to multiple languages, e.g. ZH, FR, DE and PT in a sepcific domain, can I list all the language pair in the nllb-train.yaml file (see bellow)? Before the training, do I need to do any pre-process job to the data? (such as apply sentencepiece model to the corpus)
enzh:
path_src: “/en-zh/train.en”
path_tgt: “/en-zh/train.zh”
transforms: [sentencepiece, prefix, suffix, filtertoolong]
weight: 10
src_prefix: “eng_Latn”
tgt_prefix: “zho_Hans”
src_suffix: “”
tgt_suffix: “”
enfr:
path_src: “/en-fr/train.en”
path_tgt: “/en-fr/train.fr”
transforms: [sentencepiece, prefix, suffix, filtertoolong]
weight: 10
src_prefix: “eng_Latn”
tgt_prefix: “fra_Latn”
src_suffix: “”
tgt_suffix: “”
I downloaded the checkpoints and the spm model from the given links in this thread. I ran the same inference script given in this tutorial on English FLORES data. However, I get the same output (4096 characters) for all input sentences.
I tried using other nllb variants as well. But the output from every model is garbage. Similarly, thinking my data had some issue, I tried different files, but the outcome is the same.
opennmt-py version: 3.3.0
What could be the reason?
Edit: I upgraded opennmt-py to the latest version (3.4.3) and now this problem is not there anymore.
Hello, I am a newbie and I want to use a new language to fine-tune the NLLB-200 model, but I have seen the OpenMTT-PY project and I am not sure how to use it. Then, is the extended vocabulary separate from the OpenNMT project? Looking forward to your reply!
Hi, Thanks for your share! But anyone met the Catastrophic forgetting problem?
At the fitst step, I fine-tuned en-zh with 1M sentences, and I found the en-zh BLEU score has improved a lot, however, other languages had droped a lot either, and some of them output very strange words in other languages.
After that
I fine-tuned nllb-600M with multi-language, en-zh, en-fr, en-ar …
It still came up the same problem, do you have any suggestion ?
Thank you!
Yeah, if you expect the model to retain its functionality with certain pairs you need to have them in your finetuning data, not to the same extent as what you’re mainly training on, but keeping a small subset for lang pairs you want to keep the quality mainly the same will help.
Unfortunately, improving one language pair usually comes at the cost of others unless you balance the data across all the ones you care about. Pairs which you don’t add data for will catastrophically forget.