When I fine-tune 3.3B or 1.3B(notebook on cloud GPU), it gives the error below:
File "/workspace/OpenNMT-py/onmt/train_single.py", line 165, in main
model = build_model(model_opt, opt, vocabs, checkpoint)
File "/workspace/OpenNMT-py/onmt/model_builder.py", line 412, in build_model
model.load_state_dict(
File "/workspace/OpenNMT-py/onmt/models/model.py", line 142, in load_state_dict
raise ValueError(
ValueError: Extra keys in model state_dict do not match the model config dict_keys
Thank you for the tutorial. I am using the nllb-200-600M-onmt.pt checkpoint and have followed every single detail in your tutorial, but while trying to fine tune the model, I’m getting this error in Colab: AssertionError: An error in model’s partition and checkpoint’s slice was detected
Is there something I need to change in my train.yml file?
Does it have to do with the fact that I used a different checkpoint compared to your tutorial? I’m also not sure if my vocab size is correct but this was the output from the code to modify the SentencePiece model.
Thank you so much! It worked! How may I retrieve such information about the model architecture if I have to finetune another model checkpoint in the future?
Now I get another error message
(/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ')
that seems to be because I was loading too much data at once on my collab notebook, so I think I’ll reduce the amount of data that use (600K) to around 300K
I saw that you used 341K lines so I tried using 300K lines of training data and subsequently 10 lines of training data. However, I was still thrown the same error, even after reduce the batch size to 1. I am using Google collab that provides around 12GB System RAM & 15GB GPU RAM.
(/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ')
My google search attributes the error message to running out of memory (my system ram seems to be the problem, not the GPU ram) but I’m not sure what else I can change. Currently my batching and optimisation configuration is as follows:
I realised that I don’t quite understand what is token batching in comparison with the more conventional batching? Typically we have sequences of sentences batched together to form a batch. But for token batching, eg 8 token batch size, then what is the length of the sequences of the tokens that are batched together?
Also, why is token batching used/recommended? I don’t quite understand why as I think that by defining a batch by tokens, we could be splitting a sentence up and hence the connections between words in a sentence may be broken? And as such the language model would be able to learn optimally?
Token batching doesn’t split up the sentences itself I believe. It just tries to find the amount of tokens closest to the token batch size you set, that is also a multiple of 8 (traditionally).
Token batching is ‘better’ in this case because it keeps the size of batches more standard. A batch size of 128 sentences would take 128 sentences, regardless of their size. So what can easily happen is the number of tokens in each batch can be wildly different. Token batching fixes that.
Hi Vincent, thank you so much for the tutorial!
I’m trying to finetune NLLB-200 3.3B using LoRa and the training works but when I try to translate some simple sentences then I get “” for all the sentences.
These are my config files for training and inference:
training
When you start logging the ACC/PPL (don’t wait 65000 steps) check that ACC is already very high and PPL low.
if you have a doubt, post the log here of the first 2000 steps