In my experiments, I think that, bad (or very bad) translations from a model, could be caused by bad (or very bad) entries in the training corpus. Possibly, translations errors could have no real direct relations with the bad training entries.
I'm currently experimenting a method:
1) train a model with the whole corpus
2) translate the whole training corpus with the obtained model
3) do a sentence-level evaluation of the translations, and sort entries according to it. The low evaluated sentences, below a given threshold, are almost all bad entries
4) remove all entries below a given threshold from the training set, and retrain the model obtained at step 1
5) back to step 2. Possibly, good entries that were removed in the first iteration will be kept in the second one, because of a much better model
PS : in my case, I'm doing the sentence-level evaluation with my own calculation, based on an edition distance (char tokens). But, perhaps, commonly used measure would also do the job.
PS : the general idea behind this is.. since NMT are providing very good automatic translations, very close to a human quality, a (good) NMT model can be used as an evaluator in order to optimize the reviewing/revision of a translation memory or a human translated document. Human translations that are very far from their NMT translations possibly need more attention than others.