Decreasing run time

chopinml · April 24, 2021, 2:26pm

Hello Everyone,

I switched to Kaggle from Google Colab, they provide a Tesla P100 gpu, 35 hours of running per week freely. Which is approximately %60 faster (from my observation of course, ~97 secs per 100 steps in Kaggle, ~158 secs per 100 steps in Google Colab) when using OpenNMT-py with WMT14 script here

But still session runtime is a problem. It is only 9 hours of uninterrupted run and it is not possible to reach 100K steps from the above config. 30200 steps are the maximum steps I could reach yesterday.

Is there any way to speed up running by changing some configuration like queue_size, bucket_size, batch_size, accum_count etc? P100 seems using 11GB/16GB of RAM while training

Here is the configuration of batch section, If another section is also helpful please let me know too

queue_size: 10000
bucket_size: 32768
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 4096
valid_batch_size: 16
batch_size_multiple: 1
max_generator_batches: 0
accum_count: [3]
accum_steps: [0]

Thank you

JOHW85 · April 25, 2021, 11:47am

Change your save_checkpoint_steps to something smaller, and then continue with train_from after the nine hours are up.

chopinml · April 25, 2021, 10:07pm

Yes I also thought that but asked that a couple of days ago and they said that, it will not be the same:

Did you try this on your side? Did it make so much difference with 2 part running, or 1 full running? I am curious about your observations