Hi everyone, Iʻm using SentencePiece Unigram subword to tokenize.
Iʻm getting this error when I launch the training of my bilingual language model:
Node: 'transformer_base_1/self_attention_decoder_1/dense_192/Reshape_1' Input to reshape is a tensor with 528066 values, but the requested shape has 352022 [[{{node transformer_base_1/self_attention_decoder_1/dense_192/Reshape_1}}]] [Op:__inference__accumulate_gradients_31667]
And here are my commands:
(tensorflow-env) ubuntu@ip-172-31-2-199:~/TY-EN$ # Train a tokenizer model for the source text of the training data spm_train --input=src-train.txt --model_prefix=src_spm --vocab_size=2000 --model_type=unigram ls -l src_spm.model src_spm.vocab #Check if model and vocab files are generatedTrain a tokenizer model for the target text of the training data
spm_train --input=tgt-train.txt --model_prefix=tgt_spm --vocab_size=2000 --model_type=unigram
ls -l tgt_spm.model tgt_spm.vocab #Check if model and vocab files are generatedTokenize the source text files according to the source tokenizer model
spm_encode --model=src_spm.model --output_format=piece < src-train.txt > src-train.sp
spm_encode --model=src_spm.model --output_format=piece < src-val.txt > src-val.sp
spm_encode --model=src_spm.model --output_format=piece < src-test.txt > src-test.spTokenize the target text files according to the target tokenizer model
spm_encode --model=tgt_spm.model --output_format=piece < tgt-train.txt > tgt-train.sp
spm_encode --model=tgt_spm.model --output_format=piece < tgt-val.txt > tgt-val.sp
spm_encode --model=tgt_spm.model --output_format=piece < tgt-test.txt > tgt-test.spTo train the bilingual model:
onmt-main --model_type Transformer --config data.yml --auto_config train --with_eval #After starting the training, check the TensorBoard for visual insights.
I donʻt know what Iʻm doing wrong. Any help welcome.
(EC2 Tesla T4 g4dn.xlarge 50GiB, Tensorflow 2.10, Python 3.10.12, CUDA 12.3, cuDNN 8.5, OpenNMT-tf 2.32.0).