As a corpus with 1 million sentences does not seem a general-purpose one, I assume the main question here is: “what is the purpose of this corpus?” The answer to this question can help answer your other questions.
Is it for a low-resource language? Then quality is the most important factor; we do not need another crawled (unrevised) corpus. Is it a technical dataset, then technical sentences are usually shorter, etc.
As for references, I would say papers about filtering approaches can give insights about the quality issues that should be avoided in the first place during creating a new dataset. Examples include:
There are few options here. If these students are going to really translate 1 million sentences, it is better to be real translations. I mean, instead of just collecting data randomly and translate it, check a couple of non-profit organizations and cooperate with them to translate content that can help others. There are Translators without Borders and AMARA among many other examples.
Another option would be what @argosopentech suggested, and this can be elaborated on into two options:
find a crawled dataset in your language pair, filter it automatically, and then assign human editors to check the remaining sentences. They can either approve the translation of a segment or correct it; or
use the same tools to crawl and match a new parallel dataset, and then maybe also do #1. Examples of these tools: Facebook created LASER, and Google (1, 2) created LabSE and m-USE.
I would like still to answer this specific question:
This can differ based on field and language. For example, sentences in legal texts are longer than those in technical texts, and the same idea in Spanish could be expressed in more words than in English.
So, basically what I do is that I get a dataset from different fields, delete any sentence that is larger than x (e.g. 150 words), and then check the average of word length for the remaining sentences.
Note that the max token length and filter too long in OpenNMT would be counted on subwords. So check the default or the value you assign and make sure your numbers are matching.
Would chime in that OPUS datasets are riddled with mistakes (wrong languages, mistranslations, etc). It’s just virtually impossible to manually go through them, so all sorts of cleaning/filtering techniques are out there like OpusTools and Bicleaner to help with the filtering by ratio of src/tgt length.