How to fix the train_steps for single and multi GPU cases, is it written somewhere?
search in this forum “step epoch”
I have searched but I couldn’t find something clear, I know what it is but I don’t know how to figure it out for multi GPU case. We have to just fix it to a very large number (in my case since I have more than 24 million parallel sentences), I did these computations:
I have 24339550 (total number of sentences in all shards) with batchsize of 8, I have to fixed the train_steps to the value
(24339550 /8)*100 = 304244375 for 100 epochs. But I am not sure if this is also valid for multi GPUs (I work with at least 2 GPUs). I assume that if I use two GPUs then for the same amount of epochs I have to divide it by 2 (304244375/2), so I fix train_steps as 152122187 for 2 GPUs. I hope I am right!
If your task is translation, then look at this: https://github.com/OpenNMT/OpenNMT-py/blob/master/docs/source/FAQ.md
check the transformer configuration, try to understand the token batch system.
Multi gpu just means one process on each gpu, so it sends a batch on each gpu sequentially.