Why the words in training source file are still translated as unknown?

zhum · November 26, 2017, 2:10am

I am using the default training configure, 2 layers for en/dec, 500 hidden nodes, 13 epchos,
the final model is as following:

Epoch 13, 30950/30963; acc:  34.54; ppl:  42.56; 3122 src tok/s; 2786 tgt tok/s;  11166 s elapsed
Train perplexity: 43.2077
Train accuracy: 34.3917
Validation perplexity: 144.799
Validation accuracy: 25.7105
Decaying learning rate to 0.00390625

I tried to translate the sentences in training source, lots of words are unknown. Any suggestions? Thank you very much!

jean.senellart · November 27, 2017, 6:05pm

Hello - the ppl is still quite high so high number of unknown words is expected. What is the size of training corpus?

zhum · November 30, 2017, 3:37pm

Thanks for your reply. I am using 130,000 sentence pairs now and expect to increase to 2m later.

Is it normal that the training sentence is also reported unknown words during testing? Thanks.

Sasanita · December 1, 2017, 3:05pm

I’m having the same problem. I have words that appears in dictionary but in test they are translated as unk. Could it be because the network is not well trained? It should have more layers, more neurons and epochs?

guillaumekln · December 1, 2017, 4:44pm

That is a possibility.

See here:

zhum · December 4, 2017, 5:03pm

Thanks. I am using default 2 layer and 512 hidden nodes. Will try 4 layer and 1000. Will report the update later. Hopefully, it solves the problem.

jerin · January 6, 2018, 11:55am

Hello,

I had this issue before. If you have a huge vocabulary, default 50004 for vocabulary size is chosen during preprocessing step, which affects the predictions. A large fraction of less frequent vocabulary will be predicted as unknowns even during training, due to this reason. Please check if that’s the case.