@jean.senellart Sorry I got busy the last few days and didn’t respond. I like @dbl suggestion to use Levenshtein-Damerau divided by length. I see there is already some implementation now for this, but if needed I can test / work on it in the coming weekend.
I think it would be nice to have TER (translation error rate) as well, but it would likely take me quite a while to code up TER in lua, and I’m not sure about performance.
@jean.senellart I was coming back to this again to see if in the updates the translation / epoch was included, but it doesn’t seem so. Does that mean we will go with @vince62s script for unloading / reloading during training? It’s a pity, since it seems trivial to just run through the validation data during the training.
If there is no plan to add this feature, then I will try to maintain my own script for this in case anyone else is interested. I should be getting back to this task in the next few days and will update my fork.
Would you prefer an output with just the valid set translation or even more uselfull a file
with both reference and translation, plus a score for each sentence (Bleu or TER) at the beginning of each line.
(similar output to analysis.perl in the Moses project)
@guillaumekln I’m with @vince62s on this one, I think having a flag to determine whether or not to save would be best, that way users can choose whether or not to turn it on. Having it also output the scores would be nice for additional analysis. Would you like me to take a stab at implementing this?
@guillaumekln Sweet, that looks great! So atm it would just dump the output, which is actually fine for me. If we also want the scores, then either csv, or maybe some other delimiter (eg | or <>).
Note that the comment I made earlier in this thread still applies:
However, as we set up the preprocessing, BLEU will be computed against gold sentences with resolved vocabulary, i.e. with OOV replaced by <unk> tokens.