Running training on windows with pytorch 1.0.1

jmillo · July 11, 2019, 9:03pm

I was able to convert train.py to start properly under Windows 10 using 2 GPUs.
Problems started in the utility module with a call to “torch.distributed.init_process_group()”; this function only seems to exist under torch.distributed.deprecated. Does this mean that these calls are obsolete and subject to change?
By the way if anybody is interested in train.py compatible with both Linux and Windows, I can post it.

Houda · January 2, 2020, 9:43am

Hello @jmillo,

could you tell me how you convert train.py to start under Windows 10 ?
I need your help.
Thank you very much in advance

jmillo · January 2, 2020, 3:41pm

Hello Houda,
train.py under windows 10 runs out of the box as long as you do not try to use more than 1 gpu, at least for me. The problems start with multiple gpus due to 2 factors: the use of linux/unix exclusive signals and a lack of implementation of some multi-threading functions in the Windows version of PyTorch. I made the PyTorch community aware of those problems about a year ago, but got no answer yet. The signal problem is quite easy to work around, but the other one requires an understanding of the internals of PyTorch that I have not yet acquired. In fact, I am considering moving to Linux as the single gpu limit is quite constraining.
I hope this will help you. If not, please give more details.
Regards.

jmillo · January 2, 2020, 8:46pm

Hello again,
I correct my previous entry. If you running the latest release with PyTorch 1.3, line 183 of global_attention.py should be patched from “1 - mask” to “~ mask”. If you run as is, you will get a message suggesting that change.
Regards,
J.M.