Problem with training and validation accuracy

AhmedRamadan · November 10, 2019, 2:55pm

Hello there,
I had processed my data with this command
**

!onmt_preprocess -train_src data/train.src.txt -train_tgt data/train.tgt.txt -valid_src data/valid.src.txt -valid_tgt data/valid.tgt.txt -src_vocab_size 400191 -tgt_vocab_size 62423 -src_words_min_frequency 2 -tgt_words_min_frequency 2 -src_seq_length 200 -tgt_seq_length 30 -save_data data/data -overwrite

**

after that I Trained the model with this command :

**

!onmt_train -batch_size 128 -accum_count 3 -report_every 100 -world_size 1 -gpu_ranks 0 -layers 2 -train_steps 37110 -save_checkpoint_steps 5000 -valid_batch_size 64 -valid_steps 500 -rnn_size 512 -data data/data -pre_word_vecs_enc “data/embeddings.enc.pt” -pre_word_vecs_dec “data/embeddings.dec.pt” -src_word_vec_size 224 -tgt_word_vec_size 336 -fix_word_vecs_enc -fix_word_vecs_dec -save_model data/my_model -encoder_type rnn -decoder_type rnn -rnn_type GRU -tensorboard -optim adam -adam_beta2 0.998 -learning_rate 0.001

**

-> the training accuracy of the first 1000 step , training accuracy quickly increased from 16.0 to 33 after that slowly increased up to 48 and validation accuracy decease slowly and become 32.2196
the model result becomes: BLEU = 10.53, 41.0/21.8/15.7/13.5 (BP=0.505, ratio=0.594, hyp_len=188961, ref_len=318165)

How can I solve this problem and prevent this overfitting and how can I choose loss function?
thanks

ymoslem · January 18, 2020, 10:19am

Dear Ahmed,

Could you please mention more details, like the size of your training dataset and if your training dataset and validation dataset are both from the same original dataset or if they are very different. What about your pre-embeddings?

If the final training score is 48, you (also) have “under-fitting” of the training dataset.

To fix “under-fitting”, things that can help include (ordered by importance) training a bigger network (more layers and/or more units), training longer (more steps/epochs), using different optimization parameters (optimizer, learning rate) and/or using a different architecture (RNN [and type, e.g. LSTM], or Transformer).

These two playlists by Prof. Andrew Ng are useful:
1- https://www.youtube.com/playlist?list=PLkDaE6sCZn6Hn0vK8co82zjQtt3T2Nkqc
2- https://www.youtube.com/playlist?list=PLkDaE6sCZn6E7jZ9sN_xHwSHOdjUxUW_b

Kind regards,
Yasmin

AhmedRamadan · January 24, 2020, 6:30pm

good, can I contact with you please

ymoslem · January 24, 2020, 8:09pm

I believe it is better to be here, for you to receive input from other colleagues, and for others who might benefit from the post and answers in the future.

AhmedRamadan · January 28, 2020, 3:39pm

Ok, thanks alot
I made my own embedding using gensim library
my sequence at most 300 token
I have 300K samples, how can I choose my parameters to get better word2 vec
my model is:
ast_model = Word2Vec(My_data(ast_node_after_spliting), size=32, window=10, min_count=1,iter=50)

ymoslem · January 29, 2020, 7:28am

Dear Ahmed,

So you created embeddings using 300K sentences and maybe you then trained an NMT system on the same number of sentences, right? If so, this is not how things work.

OpenNMT already applies embeddings which means the only reason you need pre-trained embeddings is when you have a small dataset for NMT training AND big pre-trained embeddings. In this case, the big pre-trained embeddings can improve the quality of translation.

I am quoting this article “Pre-trained Word Embeddings or Embedding Layer”:

We see from the results of the TREC Question Classification task that vectors trained on a small corpus will have a worse performance than an embedding layer. However, vectors trained on a large corpus beat the embedding layer by a considerable margin in terms of both precision and recall.

So let’s try to solve these issues one by one:

For training an NMT model, 300k of data is not big enough (unless you are training an in-domain model). To get more data, you can download bilingual datasets from OPUS.
If you really need to use a small dataset for NMT, then you can use big pre-trained embeddings already available on the internet for different languages. For OpenNMT-py, here is the way to convert embeddings if needed.
For your training parameters, any reason you are using GRU, for example? Try to start with the defaults, which include LSTM as the RNN type (in theory GRU is more efficient but LSTM remembers longer sentences); and later maybe you can try the Transformer model.
For the under-fitting issue, the previous points can help, but also consider a deeper network.
For the overfitting issue, if you manage increasing your training accuracy, you will probably be able to stop before overfitting. Again, make sure that your training dataset and validation dataset are both from the same original dataset (different segments but the same distribution).

I hope this helps.

Kind regards,
Yasmin

AhmedRamadan · January 29, 2020, 9:35am

OK ,I prepared my own embedding through gensim library to use it to my opennmt model
I can increase my data to 1 million (train-test-valid) so, Is this enough data to create the embedding?
Edit:
My data is programming dataset not natural language
I do my research about programming code summarization
thanks

ymoslem · January 29, 2020, 9:58am

Dear Ahmed,

Increasing your dataset is a good idea in general; you can try and see if you get quality improvement.

This is very interesting, but I do not have experience in this topic. However, I see several research papers when I search for “code summarization”, including this one that uses OpenNMT.

All the best!
Yasmin