Training Won't resume from latest checkpoint

idfc_maldini · June 20, 2021, 10:36am

Hi,

I’ve been using opennmt via google colab. Normally, if the runtime on the notebook expires it’s easy for me to restart the notebook and continue training models from the last saved checkpoint. However, in one notebook, the model won’t resume training from the last available checkpoint, and it restarts training from the beginning or the first saved checkpoint. Does anyone know why this may happen and how I can prevent it from happening? Happy to provide more details if they’re needed.

Thanks!

guillaumekln · June 21, 2021, 6:50am

Hi,

Yes, more details are needed. Can you post your command line (or code) and the training configuration?

SamuelLacombe · June 22, 2021, 3:53am

If it can help, I noticed that depending on what I changed in my yaml file, sometimes it trigger to retrain rater than continue training my current model…

idfc_maldini · June 22, 2021, 10:46am

Thanks for the reply!

This is the line to run the model

!onmt-main train_and_eval --model_type Transformer --auto_config --config weight_config_1_40000.yml --num_gpus 1

This is what was in the yaml file:

train:
save_checkpoints_steps: 5000
train_steps: 250000

eval:
eval_delay: 3600 # Every 1 hour
external_evaluators: BLEU
early stopping:
metric: BLEU
min_improvement: 0.1
steps: 4

infer:
batch_size: 32

idfc_maldini · June 22, 2021, 10:47am

sounds interesting, like what?

idfc_maldini · June 22, 2021, 10:48am

Having problems copying and pasting the tensorflow warnings/output etc I get when I start training the model as it says ‘new users can only post 2 links in a post’

guillaumekln · June 22, 2021, 11:39am

Ah you are still using OpenNMT-tf 1.x. I suggest upgrading to a more recent version if possible.

If you are facing the issue in one notebook and not others, this is probably an issue with Google Colab. Maybe the checkpoints are not correctly saved, or they are not visible in the next session.

I promoted your account so you should now be able to post that, if needed.

idfc_maldini · June 24, 2021, 10:50am

Thanks, how can I make sure I’m using the most recent version? This is how I install opennmt:

pip install OpenNMT-tf[tensorflow_gpu]==1.*

guillaumekln · June 24, 2021, 4:53pm

You could create a new Python environment and follow the installation instructions again: Quickstart — OpenNMT-tf 2.20.0 documentation. However, there is certainly additional work to do since you would upgrade from V1 to V2: 2.0 Transition Guide — OpenNMT-tf 2.20.0 documentation

That being said, I don’t think the issue is related to OpenNMT-tf. If you think it is, it would be helpful to post the steps to reproduce outside of Google Colab.

idfc_maldini · June 29, 2021, 11:35am

Thanks for the help. I’m also worried it may be a colab issue, but I’m not sure exactly what it would be and how to fix it. I’ve now started using version 2.2, hopefully that will make a difference but I’m unsure. I’ve implemented some of the advice in this link which is meant to stop the notebook from timing out, that could also help

SamuelLacombe · June 29, 2021, 11:50am

Hello Albert,

are you saving your model on your Google Drive? if not… that is most likely your issue. I’m just saying in case you don’t know, but the colab environment is totally removed every time your session expire. Your model/vocab need to be saved directly in a path that point to your Google Drive in order to keep them and reused them (continue training)

here is the code to connect your Google Drive.

#Connect Google Drive

from google.colab import drive

drive.mount('/content/gdrive')

idfc_maldini · June 29, 2021, 12:39pm

Hi Samuel,

Yes, I currently do that. That’s the code I use to mount my drive. Perhaps the model carries on training even if I’ve lost connection to my google drive, which is why the training can’t resume properly? Is that possible?

Albert

SamuelLacombe · June 29, 2021, 2:27pm

No, Google Colab has something that take care of that… are you sure your Google Drive is not full? Don’t forget to empty your trash in Google Drive.

Make sure you have lots of room too, if you are keeping lots of check points it can take lots of GB.

mayowaosibodu · September 3, 2022, 6:30pm

I’m experiencing a similar issue on Google Colab.

For some reason OpenNMT is saving the checkpoint index files, but not the data files.

E.g, I see “ckpt-10000.index”, but not “ckpt-10000.data-00000-of-00001”.

For some earlier checkpoints, both the index and data files are saved, but past a certain point it only saves the index files.

I have no idea why that is - there’s space on Google Drive, and I’m saving to a directory whose data persists between Colab sessions.

mayowaosibodu · September 4, 2022, 12:34am

Looking through the whole thing some more, it seems to be a Google Drive issue. Resetting and restarting the runtime appears to help with it.