Save validation translations at each epoch

(Matt Relich) #21

@guillaumekln having that option would be perfect!

@vince62s I agree it’s not standard, but for some applications it’s nice to see the translated output to get a feel for where models are doing well or where they are struggling. This saves you the extra step of either stopping training to evaluate, or having to run another job in parallel (and for those that have to rely on AWS for GPUs, this requires firing up a machine, copying data, etc etc.).

(Vincent Nguyen) #22

My only fear is that if we add too many things to the basic training process it will become a monster with hundreds of options.
On the other hand it would quite easy / nice to implement to have a meta-script to accomplish whatever each single user would like to accomplish. For instance, in most cases it does not make sense to translate test files after 1 or 2 epochs, getting the BLEU score on the valid set is more than enough.

Documenting a project like this is a challenging task, and it becomes even more difficult reading the code…

Anyway, I am just scared that we end up with a “do_all_for_me_please.lua” script including tokenization depending on the language, preprocessing, training , translating ,…
cleaning corpus ? :slight_smile:

WRT running another process for translating, why don’t you just open another session and run it on the same machine ?
AWS with 2 GPUs would be perfect…
Last but not least: have a look at recipes …

(Etienne Monneret) #23

Because performances on a single GPU, doing training while translating, are drastically falling down ! And this is only the performance aspect of the problem. If you have 9.5G filled for a training on a 11G GPU, you just can’t load once again the model to translate at the same time…

This demand is just motivated by real problems encountered in real situations, not just a joke for the pleasure of having a “do_all_for_me_please.lua” script…

(Matt Relich) #24

@vince62s Yes, I completely understand where you are coming from. It is a delicate balance of creating something useful and not adding too much complexity. If translation after N epochs is where you want to draw the line, then that’s fine, I will drop it. It’s not difficult for me to continue adding this hack to newer version of OpenNMT, since it is very useful for me (and a few others I suppose).

I am using OpenNMT as part of a pipeline for building a spelling correction service, so perplexity and BLEU score are kind of useless in this case. I want to know a true accuracy, ie. how often these correct sequence of letters are predicted exactly. This cannot be done without the translated portion.

You are also right, I could simply pick a large GPU instance, but now you are effectively saying to spend more money when a simple software solution exists that doesn’t require the spending of money. I don’t need that extra GPU until after each epoch is done training, so why pay for it and let it sit idle most of the time?

(Vincent Nguyen) #25

guys I am not saying this is useless.

I did it myself and shared it here (without 2 GPUs)

just questioning what should be and what should not be Onmt. (ie embedded vs meta script)

(Etienne Monneret) #26

It’s the reason why I argued that a script won’t be a solution with a single GPU, if the model currently trained is not unloaded from de card.

(Etienne Monneret) #27

I read your script carefully, but I don’t understand how it does a translation at each epoch.

(Vincent Nguyen) #28

ah ah … sorry I never pushed the last one, will do shortly.

(Etienne Monneret) #29

As said, you are unloading the model at each epoch. If you want to translate 2 files, for example, you then need to reload it 3 times at each epoch. A bit heavy process. But, of course, it’s working.

In my original query on top of this thread, I was thinking that the translation was done in the learning process. Now, I know it’s wasn’t the case.

I still think that, if the translation would now be included in the learning process, a list of files to translate on the fly would be an interesting feature.

(Vincent Nguyen) #30

unloading / reloading the model takes a few seconds compared to actual translation time and training time.

tha bash way to do things is not optimal, but at one point of time we were also discussing some kind of yaml file to describe a training process / schedule. That may be the solution in the end.

Anyhow, I leave it to Guillaume / Jean to decide, it’s their baby :slight_smile:

NB: having a look at the Moses EMS meta script could be inspiring.

(jean.senellart) #31

Hi @mrelich, I was thinking about adding more metrics to complete this development - what would be useful for your use case: CER/WER?

(David Landan) #32

Levenshtein-Damerau edit distance divided by length of reference might be handy.

(Guillaume Klein) #33

Would you consider doing a pull request for that?

I made it very easy to extend, see for example the BLEUEvaluator:

that simply implements the logic of scoring a table of predictions against their references.

Creating custom cost metric
(David Landan) #34

Sure, I’ll give it a try. Still don’t know lua very well, and I’m bogged down a bit at work, so it might be 6-8 weeks, but I’d like to do it. :slight_smile:

(jean.senellart) #35

great! and let us know if you need help, we will be glad to help but we also want to train our developper community ;)!

(David Landan) #36

@jean.senellart @guillaumekln - I snuck in some time, so no need to wait 6-8 weeks after all. :smile: Just submitted it. Feedback welcome & encouraged!

(Vincent Nguyen) #37

this is great, will test this.

btw, do you know a “good” bash/perl/py script that the same thing off line from 2 text file ?


(David Landan) #38

Eh… I saw this Cython implementation of DL edit distance which should be pretty fast. I haven’t used it myself, though. We have proprietary code for doing several metrics in parallel; unfortunately nothing I can share.

(Vincent Nguyen) #39

another one: I think you did this on per character basis.

does it make sense to do it also at a word level ?

(David Landan) #40

On a character-level, it’s edit distance; at the word-level, it’s WER (same algorithm). So, yes, it’s certainly a valid metric. We tend to prefer character-level because we believe it reflects post-editing effort better, but there are advantages and disadvantages to both approaches, in my opinion.

I’ve come up with a hybrid approach, but haven’t had the time to properly implement, test, and (very optimistically) publish it yet. :wink: