Advice for creating training and testing sets


I’m working with all of my conversation data from FB Messenger. What process do you use when creating training and test sets? What preprocessing steps do you need to take before generating the required txt for opennmt?

I’m also new to opennmt and this forum. Does anyone have any other gotchas or tips regarding preparing a conversational dataset for use with opennmt?


This post seems relevant:

You can search for SentencePiece or BPE which are commonly used for training NN models.

In the OpenNMT docs, it says we only need to provide training and validation files (src-train.txt, tgt-train.txt, src-val.txt, tgt-val.txt). When/where do we provide the test set?

The test set is not required for training but is used later on the measure the performance of your system. There are more information in the StackOverflow question linked above: