How to fix train_steps?

fermat97 · February 4, 2019, 10:38pm

How to fix the train_steps for single and multi GPU cases, is it written somewhere?

vince62s · February 5, 2019, 3:20pm

search in this forum “step epoch”

fermat97 · February 5, 2019, 10:26pm

I have searched but I couldn’t find something clear, I know what it is but I don’t know how to figure it out for multi GPU case. We have to just fix it to a very large number (in my case since I have more than 24 million parallel sentences), I did these computations:

I have 24339550 (total number of sentences in all shards) with batchsize of 8, I have to fixed the train_steps to the value (24339550 /8)*100 = 304244375 for 100 epochs. But I am not sure if this is also valid for multi GPUs (I work with at least 2 GPUs). I assume that if I use two GPUs then for the same amount of epochs I have to divide it by 2 (304244375/2), so I fix train_steps as 152122187 for 2 GPUs. I hope I am right!

vince62s · February 6, 2019, 7:25am

If your task is translation, then look at this: https://github.com/OpenNMT/OpenNMT-py/blob/master/docs/source/FAQ.md

check the transformer configuration, try to understand the token batch system.
Multi gpu just means one process on each gpu, so it sends a batch on each gpu sequentially.