Advice for creating training and testing sets

owen · March 9, 2020, 6:53pm

Hi,

I’m working with all of my conversation data from FB Messenger. What process do you use when creating training and test sets? What preprocessing steps do you need to take before generating the required txt for opennmt?

I’m also new to opennmt and this forum. Does anyone have any other gotchas or tips regarding preparing a conversational dataset for use with opennmt?

guillaumekln · March 10, 2020, 2:22pm

Welcome!

This post seems relevant:

You can search for SentencePiece or BPE which are commonly used for training NN models.

owen · March 12, 2020, 12:05am

In the OpenNMT docs, it says we only need to provide training and validation files (src-train.txt, tgt-train.txt, src-val.txt, tgt-val.txt). When/where do we provide the test set?

guillaumekln · March 12, 2020, 9:03am

The test set is not required for training but is used later on the measure the performance of your system. There are more information in the StackOverflow question linked above: