The data parallelism have been used in this program, which is a best option to accelerate speed of training.
However, I have to build two models in my program. so the CUDA memory could’t contain them.
Therefore , I want to distribute them to different gpus for training. The final loss contain of two parts, one part is respectively from two models and another is jointly produced by two models .
How to implement this function in this program ?