Hello Everyone,
I switched to Kaggle from Google Colab, they provide a Tesla P100 gpu, 35 hours of running per week freely. Which is approximately %60 faster (from my observation of course, ~97 secs per 100 steps in Kaggle, ~158 secs per 100 steps in Google Colab) when using OpenNMT-py with WMT14 script here
But still session runtime is a problem. It is only 9 hours of uninterrupted run and it is not possible to reach 100K steps from the above config. 30200 steps are the maximum steps I could reach yesterday.
Is there any way to speed up running by changing some configuration like queue_size
, bucket_size
, batch_size
, accum_count
etc? P100 seems using 11GB/16GB of RAM while training
Here is the configuration of batch section, If another section is also helpful please let me know too
queue_size: 10000
bucket_size: 32768
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 4096
valid_batch_size: 16
batch_size_multiple: 1
max_generator_batches: 0
accum_count: [3]
accum_steps: [0]
Thank you