How to load the data with torch.utils.data.distributed.DistributedSampler and torch.utils.DataLoader in OpenNMT?

rylanchiu · November 14, 2019, 3:43am

I am now trying to load the data with DistributedSampler and DataLoader in OpenNMT-py, so that I can do the distributed training on multiple nodes more easily. I wonder what would be the way to do so with minimal effort? Any hint is highly appreciated! Thank you in advance!

rylanchiu · November 15, 2019, 1:59am

Seems that the easiest hack would be creating an instance of torch.utils.data.Dataset as a wrapper of TextDataset and then pass it to the DistributedSampler and DataLoader.

francoishernandez · November 15, 2019, 8:59am

Hey @rylanchiu
Did you try the current implementation of distrubuted training with the producer/consumer setup? We made it so that it would spawn one batch producer process per node, that would create batches and put them in each GPU process’ queue.

rylanchiu · November 15, 2019, 11:18am

Hi, I am using a training platform where the user have little permission to control the nodes directly. That means I can not specify a worker, which is allocated by the system directly. Generally, what I can determine is only the number of nodes and gpus I want to use.