No I don’t think so and as a matter of fact I need to modify the tuto.
I will re-run it on my side and check what is going on.
EDIT:
So I did some tests. It just happens that those 26 tokens (beside the …) are not so common in the datasets I checked (cc-matrix, paracrawl, news-commentary). I was relying on the comment one guy did on the fairseq repo but obviously those tokens do not seem to be so necessary.
However, I did the following:
put the first 25 new tokens into a newtok.txt file (without the frequency)
gives 341K lines out of 21M
Did the same for paracrawl.
Finetuned on the restricted data.
It does learn a few tokens (滩, 鸡, 《, 》) maybe it requires training longer, BLEU is 30
But there is still a lot of “??” which the the token for sentencepiece. I don’ t speak Chinese but maybe if you identify some missing characters, it may help further.
The all procedure is fine, but learning new embeddings and impacting the model takes more training obviously. maybe the BLEU increase comes from the 《, 》characters, there seem to be a lot.
Thank you! Looks promising. We just need to determine missed characters somehow.
I don’t speak Chinese also, but asked chatgpt to find missing characters. For first lines in newstest it gives
起
衔
议
顾
商
I think it is possible to identify all, just asking chatgpt for every line with ??. But how to filter ones already in sentencepiece model?
Just by comparing reference translation from newstest2019 and generated by nllb, I got 446 missed characters. Filtered characters already in dictionary.txt and got 245 missed characters. I guess there are more. But it’s a good starting point
I want to finetune the 600M NLLB model for a language not in NLLB. Why do I need to change in the config? Tried the 1.3B but failed with OOM issue. What gpu memory size I need to finetune the 1.3B?
Thank you! The 1.3B NLLB is working using LoRa. But I had to reduce the batch_size to 256 and there was OOM at some step. You used 384 batch_size. What is with 384? it doesn’t look a random number.
[2023-05-18 08:51:11,186 INFO] Get prefix for cc-matrix-enzh: {‘src’: ‘ eng_Latn’, ‘tgt’: ‘gez_Ethi’}
[2023-05-18 08:51:11,186 INFO] Get prefix for src infer:
[2023-05-18 08:51:11,186 INFO] Get prefix for tgt infer:
[2023-05-18 08:51:11,186 INFO] Get suffix for cc-matrix-enzh: {‘src’: ‘’, ‘tgt’: ‘’}
[2023-05-18 08:51:11,186 INFO] Get suffix for src infer:
[2023-05-18 08:51:11,186 INFO] Get suffix for tgt infer:
[2023-05-18 08:51:11,266 INFO] Get prefix for cc-matrix-enzh: {‘src’: ‘ eng_Latn’, ‘tgt’: ‘gez_Ethi’}
[2023-05-18 08:51:11,266 INFO] Get prefix for src infer:
[2023-05-18 08:51:11,266 INFO] Get prefix for tgt infer:
[2023-05-18 08:51:11,309 INFO] Starting training on GPU: [0]
[2023-05-18 08:51:11,309 INFO] Start training loop without validation…
[2023-05-18 08:51:11,309 INFO] Scoring with: TransformPipe()
[2023-05-18 08:52:43,343 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,394 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,436 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,479 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,522 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,564 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,603 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,646 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,690 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,735 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,777 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,821 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,863 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,906 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,947 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,987 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:44,027 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:53:40,481 INFO] Step 10/20000; acc: 87.1; ppl: 41.1; xent: 3.7; lr: 0.01031; sents: 2059; bsz: 242/ 173/ 7; 491/350 tok/s; 149 sec;
[2023-05-18 08:55:01,678 INFO] Step 20/20000; acc: 88.6; ppl: 34.7; xent: 3.5; lr: 0.01969; sents: 2012; bsz: 228/ 171/ 6; 901/673 tok/s; 230 sec;
1
Hello, I got some problems in the “magic”. Here are the errors:
python magic.py
Traceback (most recent call last):
File "magic.py", line 5, in <module>
import sentencepiece_model_pb2 as model
ModuleNotFoundError: No module named 'sentencepiece_model_pb2'
Thanks for the extensive tutorial. I’m getting this error in Colab
“RuntimeError: The expanded size of the tensor (1024) must match the existing size (2048) at non-singleton dimension 0. Target sizes: [1024]. Tensor sizes: [2048]”