-> the training accuracy of the first 1000 step , training accuracy quickly increased from 16.0 to 33 after that slowly increased up to 48 and validation accuracy decease slowly and become 32.2196
the model result becomes: BLEU = 10.53, 41.0/21.8/15.7/13.5 (BP=0.505, ratio=0.594, hyp_len=188961, ref_len=318165)
How can I solve this problem and prevent this overfitting and how can I choose loss function?
thanks
Could you please mention more details, like the size of your training dataset and if your training dataset and validation dataset are both from the same original dataset or if they are very different. What about your pre-embeddings?
If the final training score is 48, you (also) have “under-fitting” of the training dataset.
To fix “under-fitting”, things that can help include (ordered by importance) training a bigger network (more layers and/or more units), training longer (more steps/epochs), using different optimization parameters (optimizer, learning rate) and/or using a different architecture (RNN [and type, e.g. LSTM], or Transformer).
I believe it is better to be here, for you to receive input from other colleagues, and for others who might benefit from the post and answers in the future.
Ok, thanks alot
I made my own embedding using gensim library
my sequence at most 300 token
I have 300K samples, how can I choose my parameters to get better word2 vec
my model is:
ast_model = Word2Vec(My_data(ast_node_after_spliting), size=32, window=10, min_count=1,iter=50)
So you created embeddings using 300K sentences and maybe you then trained an NMT system on the same number of sentences, right? If so, this is not how things work.
OpenNMT already applies embeddings which means the only reason you need pre-trained embeddings is when you have a small dataset for NMT training AND big pre-trained embeddings. In this case, the big pre-trained embeddings can improve the quality of translation.
We see from the results of the TREC Question Classification task that vectors trained on a small corpus will have a worse performance than an embedding layer. However, vectors trained on a large corpus beat the embedding layer by a considerable margin in terms of both precision and recall.
So let’s try to solve these issues one by one:
For training an NMT model, 300k of data is not big enough (unless you are training an in-domain model). To get more data, you can download bilingual datasets from OPUS.
If you really need to use a small dataset for NMT, then you can use big pre-trained embeddings already available on the internet for different languages. For OpenNMT-py, here is the way to convert embeddings if needed.
For your training parameters, any reason you are using GRU, for example? Try to start with the defaults, which include LSTM as the RNN type (in theory GRU is more efficient but LSTM remembers longer sentences); and later maybe you can try the Transformer model.
For the under-fitting issue, the previous points can help, but also consider a deeper network.
For the overfitting issue, if you manage increasing your training accuracy, you will probably be able to stop before overfitting. Again, make sure that your training dataset and validation dataset are both from the same original dataset (different segments but the same distribution).
OK ,I prepared my own embedding through gensim library to use it to my opennmt model
I can increase my data to 1 million (train-test-valid) so, Is this enough data to create the embedding?
Edit:
My data is programming dataset not natural language
I do my research about programming code summarization
thanks
Increasing your dataset is a good idea in general; you can try and see if you get quality improvement.
This is very interesting, but I do not have experience in this topic. However, I see several research papers when I search for “code summarization”, including this one that uses OpenNMT.