Training Won't resume from latest checkpoint

Hi,

I’ve been using opennmt via google colab. Normally, if the runtime on the notebook expires it’s easy for me to restart the notebook and continue training models from the last saved checkpoint. However, in one notebook, the model won’t resume training from the last available checkpoint, and it restarts training from the beginning or the first saved checkpoint. Does anyone know why this may happen and how I can prevent it from happening? Happy to provide more details if they’re needed.

Thanks!

Hi,

Yes, more details are needed. Can you post your command line (or code) and the training configuration?

If it can help, I noticed that depending on what I changed in my yaml file, sometimes it trigger to retrain rater than continue training my current model…

Thanks for the reply!

This is the line to run the model

!onmt-main train_and_eval --model_type Transformer --auto_config --config weight_config_1_40000.yml --num_gpus 1

This is what was in the yaml file:

train:
save_checkpoints_steps: 5000
train_steps: 250000

eval:
eval_delay: 3600 # Every 1 hour
external_evaluators: BLEU
early stopping:
metric: BLEU
min_improvement: 0.1
steps: 4

infer:
batch_size: 32

sounds interesting, like what?

Having problems copying and pasting the tensorflow warnings/output etc I get when I start training the model as it says ‘new users can only post 2 links in a post’

Ah you are still using OpenNMT-tf 1.x. I suggest upgrading to a more recent version if possible.

If you are facing the issue in one notebook and not others, this is probably an issue with Google Colab. Maybe the checkpoints are not correctly saved, or they are not visible in the next session.

I promoted your account so you should now be able to post that, if needed.

Thanks, how can I make sure I’m using the most recent version? This is how I install opennmt:

pip install OpenNMT-tf[tensorflow_gpu]==1.*

You could create a new Python environment and follow the installation instructions again: Quickstart — OpenNMT-tf 2.20.0 documentation. However, there is certainly additional work to do since you would upgrade from V1 to V2: 2.0 Transition Guide — OpenNMT-tf 2.20.0 documentation

That being said, I don’t think the issue is related to OpenNMT-tf. If you think it is, it would be helpful to post the steps to reproduce outside of Google Colab.

Thanks for the help. I’m also worried it may be a colab issue, but I’m not sure exactly what it would be and how to fix it. I’ve now started using version 2.2, hopefully that will make a difference but I’m unsure. I’ve implemented some of the advice in this link which is meant to stop the notebook from timing out, that could also help

Hello Albert,

are you saving your model on your Google Drive? if not… that is most likely your issue. I’m just saying in case you don’t know, but the colab environment is totally removed every time your session expire. Your model/vocab need to be saved directly in a path that point to your Google Drive in order to keep them and reused them (continue training)

here is the code to connect your Google Drive.

#Connect Google Drive

from google.colab import drive

drive.mount('/content/gdrive')

Hi Samuel,

Yes, I currently do that. That’s the code I use to mount my drive. Perhaps the model carries on training even if I’ve lost connection to my google drive, which is why the training can’t resume properly? Is that possible?

Albert

No, Google Colab has something that take care of that… are you sure your Google Drive is not full? Don’t forget to empty your trash in Google Drive.

Make sure you have lots of room too, if you are keeping lots of check points it can take lots of GB.

I’m experiencing a similar issue on Google Colab.

For some reason OpenNMT is saving the checkpoint index files, but not the data files.

E.g, I see “ckpt-10000.index”, but not “ckpt-10000.data-00000-of-00001”.

For some earlier checkpoints, both the index and data files are saved, but past a certain point it only saves the index files.

I have no idea why that is - there’s space on Google Drive, and I’m saving to a directory whose data persists between Colab sessions.

Looking through the whole thing some more, it seems to be a Google Drive issue. Resetting and restarting the runtime appears to help with it.

1 Like