Not sure that the number of unknown words is the best criterion. Even with several rare words a sentence could be very interesting, while sentences with only common words would be of no real interest.
Perhaps I would test something like:
- randomly select a given number of sentences : 2M, 5M, … ?
- do a fast train of a model with it.
- do a fast translation, using beam_size=1, of all your corpus. This could be very time consuming if your data set is really huge. Perhaps you will have to also restrict to a randomly chosen set.
- keep the best translated sentences : 2M, 5M, …?
- finely train a model on them
Only an idea… in the kind of this one: