I’m working on building a chatbot using conversational data. I have a domain specific corpus of 40K utterances. I tried training on only this, but results the were poor (as expected). I would like to try training on a much larger corpus first, then continue training the model on the domain specific data.
I have some questions about the process:
Would it be beneficial to generate the vocab file using both the parent dataset and the domain specific dataset? I am using SentencePiece.
Do I train the parent model until covergence, then continue training on the domain specific data until that converges?
If anyone has done this before, do you have any advice?