Training Transformer on small dataset of source code sequences

dukaenea · September 5, 2020, 3:39pm

Hello,
I am trying to train a transformer model on a very small dataset (18K). I tried different configurations of the model with regard to the hidden dimension, layers, heads and gradient clipping value, but it is performing very bad.
The task I am trying to accomplish is to teach the model to perform different actions on source code (refactoring, debugging etc). The source sequence is the source code before the change and the target sequence is the source code after the change is made. The sequences have a max length of 100 tokens.
I thought of using a pretrained model and fine tune it for my task, but since most of them are trained on natural language I am not sure if I can do that. Do you know any model trained on source code and do you think it might be worth fine tuning a model trained on natural language?
Any kind of help would be greatly appreciated.

Bachstelze · September 8, 2020, 7:07am

Hey @dukaenea, welcome to our community!
Do you have a bigger dataset to add a copy action with the same source and target sequence? The transfer-learning will support the other actions, nonetheless you don’t need a specific copy action.
Also have a look at CodeBERT. It is a transformer encoder that we can use to fill masked tokens or to initialize a seq2seq model.
The generation of source code is still very experimental.

Greetings from the translation space