Training stops after using layerdrop

SefaZeng · August 23, 2021, 12:31pm

I add the layerdrop code from fairseq to my OpenNMT code. It doesn’t look like I need to make many changes. The core code for layerdrop is the following:

    def __iter__(self):
        dropout_probs = torch.empty(len(self)).uniform_()
        for i, m in enumerate(super().__iter__()):
            if not self.training or (dropout_probs[i] > self.p):
                yield m

And I replace the pytorch module list to layerdrop list:

        if encoder_layerdrop > 0.0:
            self.transformer = LayerDropModuleList(p=encoder_layerdrop)
        else:
            self.transformer = nn.ModuleList([])

Then the training will stuck at the same step all the time, and it seems like it stops here:

        if self.accum_count > 1:
            if self.n_gpu > 1:
                grads = [p.grad.data for p in self.model.parameters()
                         if p.requires_grad
                         and p.grad is not None]
                onmt.utils.distributed.all_reduce_and_rescale_tensors(
                    grads, float(1))
            self.optim.step() # training stop here at same step, and the logger in optimizer do not report anything

Did I miss something? Is there any methods to solve this?

Zenglinxiao · August 24, 2021, 1:02pm

Hi @SefaZeng,
Since layerdrop will change the model structure, I assume that the model trained on different devices does not share the same compute graph in a single forward/backward pass which violates the assumption of the current implementation for gradient all_reduce operation. Note that grads use a list to gather all gradients need to be synchronized across devices.
I guess you could change the grads to be a map like structure and create a function similar to onmt.utils.distributed.all_reduce_and_rescale_tensors to fix this.

SefaZeng · August 30, 2021, 12:03pm

Hi @Zenglinxiao , thank you for your reply. I’m still confused about what to do. Is there some examples for this? And why change the grads from list to a map structure could solve this?

Zenglinxiao · August 30, 2021, 1:12pm

why change the grads from list to a map structure could solve this?

Once you skip some layers when training, its p.grad will be None when backward, then the grads collected will not be consistent between devices, since the layer drop is independent across devices.
That is why I recommend using a map instead to assign an id for each grad to consistently all_reduce. It is only a way to make sure the all_reduce is done between the grads with the same name which indicates the same parameters of the model.

Is there some examples for this?

Unfortunately, I have no idea if there is such an example or not, you could search on GitHub if there is anything similar.