Why the words in training source file are still translated as unknown?


(zhum) #1

I am using the default training configure, 2 layers for en/dec, 500 hidden nodes, 13 epchos,
the final model is as following:

Epoch 13, 30950/30963; acc:  34.54; ppl:  42.56; 3122 src tok/s; 2786 tgt tok/s;  11166 s elapsed
Train perplexity: 43.2077
Train accuracy: 34.3917
Validation perplexity: 144.799
Validation accuracy: 25.7105
Decaying learning rate to 0.00390625

I tried to translate the sentences in training source, lots of words are unknown. Any suggestions? Thank you very much!

(jean.senellart) #2

Hello - the ppl is still quite high so high number of unknown words is expected. What is the size of training corpus?

(zhum) #3

Thanks for your reply. I am using 130,000 sentence pairs now and expect to increase to 2m later.

Is it normal that the training sentence is also reported unknown words during testing? Thanks.

(Zuzanna Parcheta) #4

I’m having the same problem. I have words that appears in dictionary but in test they are translated as unk. Could it be because the network is not well trained? It should have more layers, more neurons and epochs?

(Guillaume Klein) #5

That is a possibility.

See here:

(zhum) #6

Thanks. I am using default 2 layer and 512 hidden nodes. Will try 4 layer and 1000. Will report the update later. Hopefully, it solves the problem.

(Jerin Philip) #7


I had this issue before. If you have a huge vocabulary, default 50004 for vocabulary size is chosen during preprocessing step, which affects the predictions. A large fraction of less frequent vocabulary will be predicted as unknowns even during training, due to this reason. Please check if that’s the case.