I’ve been using opennmt via google colab. Normally, if the runtime on the notebook expires it’s easy for me to restart the notebook and continue training models from the last saved checkpoint. However, in one notebook, the model won’t resume training from the last available checkpoint, and it restarts training from the beginning or the first saved checkpoint. Does anyone know why this may happen and how I can prevent it from happening? Happy to provide more details if they’re needed.
If it can help, I noticed that depending on what I changed in my yaml file, sometimes it trigger to retrain rater than continue training my current model…
Having problems copying and pasting the tensorflow warnings/output etc I get when I start training the model as it says ‘new users can only post 2 links in a post’
Ah you are still using OpenNMT-tf 1.x. I suggest upgrading to a more recent version if possible.
If you are facing the issue in one notebook and not others, this is probably an issue with Google Colab. Maybe the checkpoints are not correctly saved, or they are not visible in the next session.
I promoted your account so you should now be able to post that, if needed.
That being said, I don’t think the issue is related to OpenNMT-tf. If you think it is, it would be helpful to post the steps to reproduce outside of Google Colab.
Thanks for the help. I’m also worried it may be a colab issue, but I’m not sure exactly what it would be and how to fix it. I’ve now started using version 2.2, hopefully that will make a difference but I’m unsure. I’ve implemented some of the advice in this link which is meant to stop the notebook from timing out, that could also help
are you saving your model on your Google Drive? if not… that is most likely your issue. I’m just saying in case you don’t know, but the colab environment is totally removed every time your session expire. Your model/vocab need to be saved directly in a path that point to your Google Drive in order to keep them and reused them (continue training)
here is the code to connect your Google Drive.
#Connect Google Drive
from google.colab import drive
drive.mount('/content/gdrive')
Yes, I currently do that. That’s the code I use to mount my drive. Perhaps the model carries on training even if I’ve lost connection to my google drive, which is why the training can’t resume properly? Is that possible?