The model is outputting <UNKs>

AASHISHAG · December 24, 2020, 5:52pm

Hello, thank you so much for such a great open-source codebase for NMT.

I am trying to train a phonemes to grapheme (text) system using OpenNMT-py. The system works great. But it output < unks > a lot of time.

Isn’t possible to output closet possible word, instead of generating < unks>?

=====================================================================
Example1:
SRC (phonemes separated by space): g a ɪ c t s a ɪ t ɪ c m ʏ s n v i ɐ l e ː b ɛ n s m ɪ t e l ɪ m p ɔ ɾ t ə ʔ i ː n g ɾ ɔ s z e m ʔ ʊ m ʏ ŋ f ɔ n ʃ t a ː ɐ n ʔ a o s ʃ ɛ ɾ a l p ː ɐ ɔ ʊ b ə i ː ə n

TGT (words): gleichzeitig müssen wir lebensmittelimporte in grossem umfang von staaten ausserhalb der eu beziehen

Models Ouput: gleichzeitig müssen wir < unk > in grossem umfang von staaten ausserhalb der eu beziehen

=====================================================================
Example2:
SRC (phonemes separated by space): p ɔ k n e d ɪ ɡ ə h ɛ ʀ ə s p ʀ a ʁ ɡ e ɡ ɔ j a ɡ ɔ h a t a ɪ n ə z ɔ l x ə ɛ ə ʃ a ɪ ə n ʊ ŋ k ɡ e z e ə n

TGT (words): o gnädiger herr sprach diego jago hat eine solche erscheinung gesehn

Model’s Ouput: < unk > herr < unk > < unk > gegen < unk > < unk > hat eine solche < unk > < unk >

=====================================================================

ymoslem · December 28, 2020, 8:30am

Dear Aashish,

Try to use BEP to prepare your data before the training, using SentencePiece. I have an example here. Make sure you have --model_type=bpe

All the best,
Yasmin

AASHISHAG · December 30, 2020, 10:55pm

@ymoslem: Thank you for your response. I have question regarding setting --model_type=bpe

Do we need to set it for both source and target? or just for source?
In the ReadMe, it says only at : https://github.com/ymoslem/MT-Preparation/blob/main/subwording/1-train.py#L46

Shouldn’t we should also set it for target?

github.com

ymoslem/MT-Preparation/blob/main/subwording/1-train.py#L53


# Source subword model
source_train_value = '--input='+train_source_file_tok+' --model_prefix=source --vocab_size='+str(source_vocab_size)+' --hard_vocab_limit=false'
spm.SentencePieceTrainer.train(source_train_value)
print("Done, training a SentencepPiece model for the Source finished successfully!")
# Target subword model
target_train_value = '--input='+train_target_file_tok+' --model_prefix=target --vocab_size='+str(target_vocab_size)+' --hard_vocab_limit=false'
spm.SentencePieceTrainer.train(target_train_value)
print("Done, training a SentencepPiece model for the Target finished successfully!")

Another question, for prediction (test set), do I need to run the subword.py script again on the src_test?
The current subword script requires both src and tgt, but I only have src for the test set?

ymoslem · December 31, 2020, 10:14pm

Dear Aashish,

Do we need to set it for both source and target? or just for source?

If you change the model type, it must be for both the source and target. I will clarify this. Thanks!

Another question, for prediction (test set), do I need to run the subword.py script again on the src_test?
The current subword script requires both src and tgt, but I only have src for the test set?

It supposes that you have the reference file. If you do not, just comment out anything related to the target. This script should work on source only.

Kind regards,
Yasmin

AASHISHAG · December 31, 2020, 11:23pm

Dear @ymoslem,

Thank for so much for the guidance. It’s really helpfull.

I tried the following steps:

1. python filter.py de ch src.txt tgt.txt
2. python 1-train.py src.txt-tokenized.de tgt.txt-tokenized.ch (applied --model_type=bpe, for both source and target) 
3. python 2-subword.py source.model target.model src.txt-tokenized.de tgt.txt-tokenized.ch
4. python train_dev_split.py 1500 src.txt-tokenized.de.subword tgt.txt-tokenized.ch.subword

Then I build my vocabulary:
5. onmt_build_vocab -config config.yaml -n_sample 10000

Then I trained the system:
6. onmt_train -config config.yaml

Here is the YAML --> https://docs.google.com/document/d/1afdesqN7VZyNu8XCH7znTJVuFk1o2i7-YM43lWM9WLs/edit?usp=sharing

The system trained with the following last logs:
Step 10000/10000; acc: 99.92; ppl: 1.02; xent: 0.02; lr: 0.00088; 2027/1890 tok/s; 60142 sec

For testing:
I ran the script subword.py by commenting the target part in the script on the src.test, and ran onmt_translate.

Results:
Without BPE, adding --replace_unk flag
SENT 4: [ ‘wird’, ‘auch’, ‘künftig’, ‘davon’, ‘können’, ‘profitieren’]
PRED 4: wird au künftig devo chönne profitiere --> (90% correct)

WITH BPE
SENT 4: [’~~’, ‘▁ wird’, ‘▁ auch’, ‘▁ künftig’, ‘▁ davon’, ‘▁ können’, ‘▁profitiere’, ‘~~’]
PRED 4: [ ~~▁verstehen ▁sie ▁ihn ▁auch ▁immer ]~~

PRED 4: verstehen sie ihn auch immer --> after running desubword.py script --> (0% correct)

(sorry for these cross lines. I don’t know why they comming)

Problem:
It is surprising to see that the results with BPE are no where close to word based tokenization method. Please note I am working on German dataset.

Kindly advice, if I am missing something.

Regards,
Aashish

ymoslem · January 1, 2021, 12:45pm

Dear Aashish,

What is the size of your dataset? I see 99.92 accuracy after 10000 steps. This is overfitting.

Regarding segmentation, you can also try the default unigram model --model_type=unigram, i.e. by removing --model_type=bpe. Skip the filtering step (first script) just in case.

I was thinking if your “phonemes” model would better act with character-based segmentation --model_type=char, but this needs testing.

That said, with a small dataset, it is normal for NMT, not to give perfect outputs.

Kind regards,
Yasmin