OpenNMT-py models stops making training progress while trying to validate

I’m training an OpenNMT-py model and got to 5000 steps but then once validate runs it stops making progress. nvidia-smi shows GPU usage as 99% and the training process is still running. I can’t tell if something has gone wrong, validating is taking a long time, or I have too many validate steps (5000). I’ve stayed 5000 steps into training for > 8hrs on a Nvidia K80 GPU. Is there a recommendation for how many validate steps to do, and should there be any output while validate is running?

# config.yml
save_checkpoint_steps: 1000
valid_steps: 5000
train_steps: 50000
[2021-02-24 02:13:23,304 INFO] Step 4800/50000; acc:  38.57; ppl: 26.90; xent: 3.29; lr: 0.00059; 2724/1950 tok/s;  26226 sec
[2021-02-24 02:17:57,759 INFO] Step 4850/50000; acc:  25.25; ppl: 90.83; xent: 4.51; lr: 0.00060; 2390/1925 tok/s;  26501 sec
[2021-02-24 02:22:33,240 INFO] Step 4900/50000; acc:  25.38; ppl: 80.47; xent: 4.39; lr: 0.00061; 2370/1891 tok/s;  26776 sec
[2021-02-24 02:27:04,919 INFO] Step 4950/50000; acc:  33.67; ppl: 39.46; xent: 3.68; lr: 0.00061; 2601/1991 tok/s;  27048 sec
[2021-02-24 02:31:34,405 INFO] Step 5000/50000; acc:  34.04; ppl: 38.78; xent: 3.66; lr: 0.00062; 2649/1985 tok/s;  27317 sec
[2021-02-24 02:31:34,407 INFO] valid's transforms: TransformPipe(SentencePieceTransform(share_vocab=True, src_subword_model=sentencepiece.model, tgt_subword_model=sentencepiece.model, src_subword_alpha=0.0, tgt_subword_alpha=0.0, src_subword_vocab=, tgt_subword_vocab=, src_vocab_threshold=0, tgt_vocab_threshold=0, src_subword_nbest=1, tgt_subword_nbest=1), FilterTooLongTransform(src_seq_length=150, tgt_seq_length=150))
[2021-02-24 02:31:34,408 INFO] Loading ParallelCorpus(split_data/src-val.txt, split_data/tgt-val.txt, align=None)...

Thanks for any help.

Full code

What is the size of your valid set ?
Note: valid_steps means that validation will happen every valid_steps, not that validation takes valid_steps steps.

1 Like

Note: valid_steps means that validation will happen every valid_steps, not that validation takes valid_steps steps.

Makes sense thanks.

What is the size of your valid set ?

I think this was the problem. I had a bug in my training script where data got added 5+ times to the valid data making the valid data 81 million lines.

I’m trying to use 30% of my total data as validate data.

I’m trying to use 30% of my total data as validate data.

As I don’t know what absolute size your total data is, it’s not very helpful.
But probably the problem yes, this seems to be a bit too much.

I have ~30,000,000 lines of data and I’m switching to using 10% for validation so ~3,000,000. I was doing 30% before which would have been ~9,000,000 but because of a bug in my training script data was added multiple times so there were ~81,000,000 lines of validation data.

3M lines for validation is still quite huge… a few thousands or tens of thousands should do the trick.
Anyways, the behaviour you witnessed is ‘normal’ with regards to the amounts of data.

2 Likes

Thanks for the link, I thought you wanted to use much more data for validation. My validation step was still taking a very long time and I switched to using 2000 sentences for validation.

This worked, I switched to 2000 lines of valid data and it’s running fine. Thanks!