Getting no output when usig SentencePiece

aquorio15 · April 20, 2022, 4:08am

Hello everyone

I have successfully been able to train a transformer model for english to indic languages, but i was getting a lot of unknown token () so i decided to try sentence piece to mitigate the problem, but after using sententencepiece tokenizer i am not getting any output.

I have converted the vocabularies in opennmt-tf format from this link Vocabulary — OpenNMT-tf 2.26.1 documentation

My yaml file
model_dir: run/

data:

train_features_file: train.en

train_labels_file: train.ka

eval_features_file: val.en

eval_labels_file: val.ka

source_vocabulary: vocab.src

target_vocabulary: vocab.tgt

source_tokenization:

type: SentencePieceTokenizer

params:

  model: vocab.src.model

target_tokenization:

type: SentencePieceTokenizer

params:

  model: vocab.src.model

train:

save_checkpoints_steps: 1000

maximum_features_length: 50

maximum_labels_length: 50

batch_size: 4096

max_step: 500000

save_summary_steps: 100

eval:

external_evaluators: BLEU

export_format: saved_model

params:

average_loss_in_time: true

infer:

batch_size: 32
My output: https://drive.google.com/file/d/1-3oDnE5OKZze8I4AddXr8DzB-YOzhU0t/view?usp=sharing

Any kind of help would be greatly appreciated

ymoslem · April 23, 2022, 4:12am

Dear Amartya,

The size of your dev/test dataset is so big. For low-resource languages, randomly select 1000 or 2000 segments for the dev and test datasets, and add the rest to your training data.

This makes me ask, what is the current size of your training dataset?

What are the datasets you use for Kannada? If you are using crawled datasets, make sure to filter them as they might have a lot of issues including wrong languages.

Training up to 5000 steps might be not be enough, what if you try to train more? (I am referring here to the prediction file you shared, ending with 5000.)

If you suspect there is an issue with the on-the-fly tokenization, try to apply your SentencePiece model on your training and dev data first, and remove the on-the-fly tokenization options.

After doing everything correctly, and having some output, you will still have to apply other approaches to improve the system, such as Tagged Back-Translation explained in other posts on this forum.

Kind regards,
Yasmin