ModuleNotFoundError Fused_Adam_Cuda in Google Colab

Hello,

I’m trying to finetune a pretrained transformer with OpenNMT-py in google colab. I installed OpenNMT-py in colab but when I try to run onmt_train with my yaml file I get this error:

Traceback (most recent call last):
File “/usr/local/bin/onmt_train”, line 8, in
sys.exit(main())
File “/usr/local/lib/python3.7/dist-packages/onmt/bin/train.py”, line 172, in main
train(opt)
File “/usr/local/lib/python3.7/dist-packages/onmt/bin/train.py”, line 157, in train
train_process(opt, device_id=0)
File “/usr/local/lib/python3.7/dist-packages/onmt/train_single.py”, line 71, in main
optim = Optimizer.from_opt(model, opt, checkpoint=checkpoint)
File “/usr/local/lib/python3.7/dist-packages/onmt/utils/optimizers.py”, line 274, in from_opt
build_torch_optimizer(model, optim_opt),
File “/usr/local/lib/python3.7/dist-packages/onmt/utils/optimizers.py”, line 85, in build_torch_optimizer
betas=betas)
File “/usr/local/lib/python3.7/dist-packages/onmt/utils/optimizers.py”, line 579, in init
fused_adam_cuda = importlib.import_module(“fused_adam_cuda”)
File “/usr/lib/python3.7/importlib/init.py”, line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 1006, in _gcd_import
File “”, line 983, in _find_and_load
File “”, line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named ‘fused_adam_cuda’

I have a GPU runtime set up, but it seems to not be able to find the fused_adam_cuda module in the apex library. I’ve tried reinstalling apex manually a few different ways but to no avail. I’ve checked the installed apex files and made sure the fused_adam_cuda file is there, so I’m not sure why it can’t import the module. I was wondering if anyone else has run into this issue or had any suggestions? I’ve also tried installing OpenNMT-py by cloning the github repo but when I try to train it that way I get this error: ImportError: cannot import name ‘Field’ from ‘torchtext.data’ and the solutions I’ve found online to that issue didn’t work either. Please let me know if you need any more info, I really appreciate the help!

Hi,

Most likely there was an issue when installing Apex. Make sure to follow their instructions and see if any error is reported.

This discussion might be helpful as well: FP16 training error · Issue #1645 · OpenNMT/OpenNMT-py · GitHub

The whole apex setup is far from ideal, but when trying to get rid of it we witnessed unwanted behavior with torch/amp. This should be investigated again at some point.