I’m training an OpenNMT-py model and got to 5000 steps but then once validate runs it stops making progress. nvidia-smi shows GPU usage as 99% and the training process is still running. I can’t tell if something has gone wrong, validating is taking a long time, or I have too many validate steps (5000). I’ve stayed 5000 steps into training for > 8hrs on a Nvidia K80 GPU. Is there a recommendation for how many validate steps to do, and should there be any output while validate is running?
What is the size of your valid set ?
Note: valid_steps means that validation will happen every valid_steps, not that validation takes valid_steps steps.
Note: valid_steps means that validation will happen every valid_steps, not that validation takes valid_steps steps.
Makes sense thanks.
What is the size of your valid set ?
I think this was the problem. I had a bug in my training script where data got added 5+ times to the valid data making the valid data 81 million lines.
I’m trying to use 30% of my total data as validate data.
I have ~30,000,000 lines of data and I’m switching to using 10% for validation so ~3,000,000. I was doing 30% before which would have been ~9,000,000 but because of a bug in my training script data was added multiple times so there were ~81,000,000 lines of validation data.
3M lines for validation is still quite huge… a few thousands or tens of thousands should do the trick.
Anyways, the behaviour you witnessed is ‘normal’ with regards to the amounts of data.
Thanks for the link, I thought you wanted to use much more data for validation. My validation step was still taking a very long time and I switched to using 2000 sentences for validation.