Can Word-level corpus decrease the high frequency of 'unknown' tag?

I trained a model with options below.

-3M sentences for each source and target language.
-layers : 8, rnn_size : 1000, vocab_size : 200000
-word embeddings were included with pre_word_vecs_enc, pre_word_vecs_dec.
-target-language corpus were also trained.

But when I try to translate even with trained data, too many words still remain ‘unknown’.
How can I improve the performance?

Could training with word-level corpus(not sentence-level) be helpful?

Hi!
Do your embeddings be trained on the training data vocabulary ?
Do you extend your embeddings using the embeddings.lua script according to the training data ?
Do your test data share domain/vocabulary with your training data?
Do you use bpe segmentation?

It seems like you are having a representation problem, because, as I see it, you are using enough data and a big vocabulary but you are not having a good coverage.

I will suggest you to try the bpe segmentation. This will help the system to handle the less frequent or unknown words.

Also, you can introduce word-level aligned pairs or even segment/phrase aligned pairs in your training data, but be careful to introduce enough examples to let the system see this particular cases enough times to learn them. Take into account that this will let the system learn how to deal with parts of a sentence, or even single words, more than learning more or better vocabulary inferences.

Good luck!

Embeddings were trained on the training data vocabulary.
I used embeddings.lua to map pretrained word2vec vectors to the built vocabulary.
But I didn’t consider BPE segmentation. Thanks for your help.

Anyway, it was written about embeddings.lua
"When training with small amounts of data, performance can be improved by starting with pretrained embeddings."

With 3M data, can I expect the performance to be improved?

Indeed, if you use pretrained embeddings you can expect to improve your translation system quality, specially when you have small amounts of data that cannot let the system learn properly the semantic distribution needed for the embeddings.

If you use pretrained embeddings you are giving to the system a “good starting point”, but after seeing 3M sentence pairs, I think the embeddings will be totally adapted to the train data semantic space, so maybe in this case it is possible that it cannot help very much to have a pretrained embeddings than having a random starting point.

I will definitely give BPE or other segmentation a try :slight_smile:

1 Like