I am now trying to load the data with DistributedSampler and DataLoader in OpenNMT-py, so that I can do the distributed training on multiple nodes more easily. I wonder what would be the way to do so with minimal effort? Any hint is highly appreciated! Thank you in advance!
Seems that the easiest hack would be creating an instance of torch.utils.data.Dataset as a wrapper of TextDataset and then pass it to the DistributedSampler and DataLoader.
Hey @rylanchiu
Did you try the current implementation of distrubuted training with the producer/consumer setup? We made it so that it would spawn one batch producer process per node, that would create batches and put them in each GPU process’ queue.
Hi, I am using a training platform where the user have little permission to control the nodes directly. That means I can not specify a worker, which is allocated by the system directly. Generally, what I can determine is only the number of nodes and gpus I want to use.