Hi,
I have 4 datasets that vary in size. One is tiny, one is medium, one is large, and one is huge (relatively). My shard size is about 250000, therefore the datasets are being split into tiny.0, medium.0, large.0, huge.0, huge.1, huge.2, huge.3, huge.4.
During the training, I’ve noticed 2 things:
- Huge shard 0 has been loaded once only, while tiny/medium/large.0 shards are being loaded every 5 minutes.
- Huge shards 1/2/3/4 have never been loaded.
I’m currently at step 170000/200000. I want to understand if I’m doing anything wrong with the shards as to why it’s not loading the other shards, and why the smaller shards are being loaded constantly.
Thanks
What’s your batch_size and what are your data weights ? It may be possible that you haven’t reached the end of your huge.0 shard yet.
batch size is 2048, my data weights are 2 2 1 1
. I’m training on 4 GPUs. The reason why I have the weights like that is that I wanted to give the smaller dataset more training time, but I think I misunderstand how the data weights work.
2 2 1 1 means that, to constitute a batch, 2 examples will be taken from corpus 1, then 2 examples from corpus 2, then 1 example from corpus 3, then 1 example from corpus 4, and we loop again, 2 examples from corpus 1, etc.
This means that your huge dataset actually accounts for 1/6th of the examples seen during training. That may explain why shard 1 was not reached yet.
“to constitute a batch” this changes my understanding of the weights, and makes more sense to me! I realize now I need to find a better distribution of the weights. I used to believe that a weight of 2 would double the data, so that if it was 1000 it becomes 2000, but I understand now that it doesn’t double it per total, instead it’s per batch, so the data would get to ten folds faster than other data. Thanks!