In my experiments, I think that, bad (or very bad) translations from a model, could be caused by bad (or very bad) entries in the training corpus. Possibly, translations errors could have no real direct relations with the bad training entries.
I’m currently experimenting a method:
train a model with the whole corpus
translate the whole training corpus with the obtained model
do a sentence-level evaluation of the translations, and sort entries according to it. The low evaluated sentences, below a given threshold, are almost all bad entries
remove all entries below a given threshold from the training set, and retrain the model obtained at step 1
back to step 2. Possibly, good entries that were removed in the first iteration will be kept in the second one, because of a much better model
PS : in my case, I’m doing the sentence-level evaluation with my own calculation, based on an edition distance (char tokens). But, perhaps, commonly used measure would also do the job.
PS : the general idea behind this is… since NMT are providing very good automatic translations, very close to a human quality, a (good) NMT model can be used as an evaluator in order to optimize the reviewing/revision of a translation memory or a human translated document. Human translations that are very far from their NMT translations possibly need more attention than others.
this is close to Neural Machine Translation from Simplified Translation - where the translated corpus is not used to filter the corpus, but to … retrain the model. Naturally these sentence pairs are disappearing and substituted by more plausible translation. It would be interesting to see if we can combine, and/or how much we can gain in quality by removing wrong sentence pairs.
In my case, the threshold in step 4 is made sufficiently low to remove errors in the training corpus : entries that, for some different reasons, should NOT be in.
But, if the threshold is a bit higher, perhaps it would be possible to also remove entries being correct, but being:
less compatible with the learning capability of the network,
less homogeneous to the whole corpus, thus singularities to manage for the network
variants of other versions of same source sentences, thus contradictions to manage for the network
In your paper, I wonder if the main gain of your method, is not in fact simply to indirectly remove such bad entries of the corpus, much more than simplifying the translations, as you claim… hard to know.
In the same kind of idea, here is something to test in order to build in-domain models:
build a small model on the (small) in-domain training set. It’s supposed to be quite fast.
translate a big generic training set (Europarl, MultiUN, …) with the small in-domain model.
make a sentence-level evaluation of the translations. Sort sentences by estimated quality.
build a new training set by mixing the in-domain training set with the part of the generic corpus over a quality threshold. This part of the generic set is supposed to be the more homogeneous with the in-domain set.
In step 1, I think it’s possible to force a fast convergence, with a learning rate decaying strongly in very few epochs. In fact, a draft model may be sufficient to do the next evaluation.
The problem with statistical evaluations, is that they may be quite disconnected from the final translation goal. As you said, it’s also often the problem with the BLEU evaluation. Just as an example, it tooks me few weeks/months of NMT experimentation to understand that the PPL evaluations are strictly of no interest, especially the validation one.
In my case above, I’m trying to focus on the ability of some sentences to be well translated by a model, using an edition-like char distance close to the post-edition cost (translators are often working char by char). Whatever the content of the black box (here, the NMT model), all the process if thought according to the final translation goal, and its constraints.
PS : for the question of providing scripts… I’m working with Java, and my code is very dependent on the whole dev project (classes, packages, libraries, …). It may be a bit hard to share.
Here is my Quality based on Distance Edition (QED) calculation:
i=number of char insertions
d=number of char deletions
s=number of char substitutions
len1=number of chars of reference sentence
len2=number of chars of translated sentence
QED = 100 * ( 1 - ( i + d + 2*s ) / (2 x Min(len1,len2) ) )
For an exact translation QED=100
If all chars need to be re-written QED=0
If a big difference in size between sentences QED < 0
In practice, I think that a translation start bringing something possibly useful for QED>50
For a whole corpus evaluation, I simply sum i, d, s, len1, and len2 over all sentences.
But, my goal is not really to evaluate the quality of a translation, but rather to evaluate the post-editing cost needed by a translation.
As far as I know, translators are almost never using copy/cut/paste at a word level (when post-editing within a given sentence). They are working at char level operations. The most used keys are certainly DEL/SUPPR. It’s the reason why a substitution costs twice in my formula. It’s also why my normalisation is made according to Min(len1,len2), more pertinent than the reference or the translation length, that would wrongly minimize costs in a lot of cases.
PS : even if a translation proposition is quite good, if the words or the formulation are not the one a translator would have written, he will edit it to set it his own way…
Do you have a note of what BLEU score you got when you did this? I just translated a part of the training corpus with the relevant model and got 59.55 although I was expecting higher.
59.55 seems to me already a high score ! I wonder if your validation set isn’t too near from your training set…
My gain was around BLEU +1 / +2. But, in fact, my goal wasn’t to gain on the BLEU itself. This BLEU evaluation is highly depend on many factors, especially the way the translation references are built. You could get very different translations from reference translations, and both good. It’s not that pertinent… especially when you reach such high values.
My first goal in this process (see first post) was to gain on (few) very bad translations. These very bad translations are quite invisible in the BLEU result calculated on a whole set.
PS : I think the BLEU effect could be much more effective when this technique is applied for in-domain training, as explained in post #6. The process explained in post #1 is made to remove aberrant entries from the training set, not to really improve its global quality.