Hello everyone
I have successfully been able to train a transformer model for english to indic languages, but i was getting a lot of unknown token () so i decided to try sentence piece to mitigate the problem, but after using sententencepiece tokenizer i am not getting any output.
I have converted the vocabularies in opennmt-tf format from this link Vocabulary — OpenNMT-tf 2.26.1 documentation
My yaml file
model_dir: run/
data:
train_features_file: train.en
train_labels_file: train.ka
eval_features_file: val.en
eval_labels_file: val.ka
source_vocabulary: vocab.src
target_vocabulary: vocab.tgt
source_tokenization:
type: SentencePieceTokenizer
params:
model: vocab.src.model
target_tokenization:
type: SentencePieceTokenizer
params:
model: vocab.src.model
train:
save_checkpoints_steps: 1000
maximum_features_length: 50
maximum_labels_length: 50
batch_size: 4096
max_step: 500000
save_summary_steps: 100
eval:
external_evaluators: BLEU
export_format: saved_model
params:
average_loss_in_time: true
infer:
batch_size: 32
My output: https://drive.google.com/file/d/1-3oDnE5OKZze8I4AddXr8DzB-YOzhU0t/view?usp=sharing
Any kind of help would be greatly appreciated