How to change the data parallelism to model parallelsim?

The data parallelism have been used in this program, which is a best option to accelerate speed of training.
However, I have to build two models in my program. so the CUDA memory could’t contain them.
Therefore , I want to distribute them to different gpus for training. The final loss contain of two parts, one part is respectively from two models and another is jointly produced by two models .

How to implement this function in this program ?

What did you try so far? I think this should just be about carefully placing and moving tensors on the correct device.

You are right. I need to sent the two models to different GPUs and distribute the data to corresponding devices.
In the last, I also need to move the losses on different GPUs to a same GPU and add up them for back-propagation.