Save validation translations at each epoch

Etienne38 · March 10, 2017, 8:39am

If ONMT is building the validation translation for its evaluation at each epoch, why not saving it in a file at the same time as the model ? Perhaps with the same name as the model, and a “.valid.trans” suffix ?

guillaumekln · March 10, 2017, 8:56am

It does not actually translate the validation data but just computes the perplexity on this dataset.

See also:

Etienne38 · April 5, 2017, 11:48am

It would be useful to be able to give an optional list of files to be translated at each epochs.

In my tests, I want to eval BLEU/TER/WER/… values for each epoch. I want to do it on a validation file, and a small part of the training file, to get curves similars to training PPL and validation PPL, but with evaluation metrics. To do it in a comfortable way, I have to stop the training, launch all translations, then restart the training. It’s time consuming, and there is always a risk to damage some options somewhere, and not restart in the right conditions.

guillaumekln · April 5, 2017, 1:46pm

We are thinking about a meta-script that runs the training epoch by epoch, runs arbitrary commands after one epoch (BLEU evaluation for example) and sets new options as needed. It could just be a script that abstracts the retraining command line and let the user plug arbitrary rules.

No definite plans for it though but I think @vince62s is exploring the possibilities of such training process.

tel34 · April 5, 2017, 5:33pm

I’m starting to realise too that this would be something handy

vince62s · April 5, 2017, 7:29pm

I will try to check in a new recipe to support this. Patience.

I am just validating it works fine in various contexts.

@tel34 : did you try a deeper network ?

tel34 · April 6, 2017, 8:17pm

@tel34 : did you try a deeper network ?
@vince62s Training is ongoing - it’s a big corpus Will report my findings as soon as it finishes!

mrelich · June 2, 2017, 7:17pm

Hi all! I also wanted this feature and I hacked in a translate option into the Traner.lua. You can find my code here: https://github.com/mrelich/OpenNMT/tree/valid_translate

If you think this might be useful I can clean it up and merge.

guillaumekln · June 2, 2017, 8:20pm

Hi, very nice!

We could definitely check that in. And it is a good opportunity to add BLEU as a validation metric.

We’ll see that next week.

mrelich · June 6, 2017, 10:07pm

Great! Let me know if you need any help or want me to make any modifications.

jean.senellart · June 16, 2017, 10:30pm

Hi Matt,

I introduced BLEU code in lua here:

would you like to add this up in you branch to also calculate the score at each epoch?

it is now merged in master - the function to call for calculating BLEU in the code is simply:

BLEU = onmt.scorers.bleu(cand, refs)

where:

cand is a table of sequences representing translation output (each sequence is a table of tokens).
refs is a table of references, each reference has the same format than cand

vince62s · June 17, 2017, 9:32am

@jean.senellart

is this “smoothed” BLEU+1 or BLEU ?
the difference is mainly for small length sequences.

just to make sure this is comparable.

jean.senellart · June 17, 2017, 12:34pm

It is regular BLEU - as implemented in multi-bleu.perl.

vince62s · June 17, 2017, 3:33pm

ok, just to be clear, with no smoothing, it returns Bleu=0 for sequence length of 3 or less, when default order is 4.
In Moses, when per sentence scoring is needed, since you may have a lot of very short sentence, it is smoothed with the +1. I am just thinking that since we sort by size to create batches, you may have several batches scored with 0.

no an issue, just for reference.

guillaumekln · June 20, 2017, 9:59am

Thanks @jean.senellart for the BLEU implementation.

I’m working to integrate this and extend @mrelich work.

mrelich · June 20, 2017, 4:11pm

@jean.senellart sorry I didn’t see your message! For some reason I didn’t get notified of the message. Yeah I am happy to integrate into my branch, but it seems @guillaumekln is already working towards this. I’m happy to help if needed, just ping next time, since that seems to get an email to my inbox.

guillaumekln · June 21, 2017, 9:26am

I first propose the support of BLEU as a validation metric which should be a very useful feature:

It’s not exactly related to this feature request as we don’t save the translation output but it’s the idea to introduce BLEU scoring (or similar score) during the training process. Everything that used the validation perplexity now more generally use the validation score, including the learning rate decay.

Monitoring BLEU evolution should be more natural than perplexity. However, as we set up the preprocessing, BLEU will be computed against gold sentences with resolved vocabulary, i.e. with OOV replaced by <unk> tokens. I think this is a detail and does not impact the score interpretation but let me know if you have any concerns regarding this.

Then, I think we will add what @Etienne38 proposed based on @mrelich work:

The difference is that it’s not directly related to validation (in the sense of how it is used in OpenNMT) but is a way to automatize translation of files across epochs. Something like this:

th train.lua [...] -translate_files file1.txt file2.txt file3.txt -translate_every 2

which translates file1.txt, file2.txt, and file3.txt every 2 epochs. Does this make sense/is useful?

Etienne38 · June 21, 2017, 9:33am

Perfect.

vince62s · June 21, 2017, 9:33am

IMO, translating files could be done separately in another process.

I find it odd to mix “valid” and “test” stuff in the training process.

Maybe just translating the “valid_src” (since it’s done to score with BLEU) might be enough.

Etienne38 · June 21, 2017, 9:38am

Each epoch, without stopping/restarting the training process, I would enjoy to be able to translate files like:

validation set = sentences supposed not overlapping the training data
checking set = sentences contained in the training data
both of them possibly in 2 versions = with a generic content or in-domain content
a file used in an other completely different training, for example from an other domain, for comparison
…