Install Torch 7 with CUDA 9.1

tel34 · February 1, 2018, 6:50pm

Thanks for that. I have temporarily gone back to Cuda 8 but am experiencing the same nvcc fatal error Unsupported gpu architecture ‘compute_61’. As I have experienced no issues installing Cuda 8 on a machine with GTX 1070, I must put down the failure down to an issue with the new GTX 1080Ti.
I’m assuming that after your installation you are successfully running OpenNMT?
I will run your procedure in the morning.
Thanks again.
Terence

jean.senellart · February 2, 2018, 5:07pm

Hi Terence, yes - OpenNMT runs smoothly with the procedure described above - just missing half precision, but we had no success with half precision so far anyway.

Best!

vince62s · February 2, 2018, 5:09pm

did you try it on Ubuntu 14.04 too ?

jprama · February 3, 2018, 1:00pm

Cuda 9 is not supported on ubuntu 14. It is only available on ubuntu 16.04 and ubuntu 17.04 ( https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu )

tel34 · February 3, 2018, 5:18pm

Seem to have narrowed this down to an issue with THCBlas which is the point where the NVCC build fails (at 4%). Will report my progress.
Terence

vince62s · February 4, 2018, 9:58am

I just installed cuda 9.0 + patch 1 on ubuntu 14.04 and it runs fine with TF 1.5.0 (which precompiled binary comes for cuda 9.0)
it works fine.
I’ll try later to install torch.
just to report that it is not really an issue with cuda 9 not being compatible with ubuntu 14, but could be a library issue later on when building.

tel34 · February 5, 2018, 12:24pm

Everything goes fine with the installation according to the procedure outlined by @jprama until we get to the CUDA section. I am puzzled that Cuda 7.5 is reported because I have installed Cuda-9.1 and it has passed the tests in the Samples. The first “failure” is at ib/THC/CMakeFiles/THC.dir/build.make:70: recipe for target ‘lib/THC/CMakeFiles/THC.dir/THC_generated_THCBlas.cu.o’ failed
and I honestly do not know what to look for here. Any suggestions would be welcome. The relevant part of my install log is given below:

Found CUDA on your machine. Installing CUDA packages
Building on 4 cores
-- Found Torch7 in /home/miguel/torch/install
-- Removing -DNDEBUG from compile flags
-- TH_LIBRARIES: TH
-- Found gcc >=5 and CUDA <= 7.5, adding workaround C++ flags
-- MAGMA not found. Compiling without MAGMA support
-- Automatic GPU detection failed. Building for common architectures.
-- Autodetected CUDA architecture(s): 3.0;3.5;5.0;5.2;5.2+PTX
-- got cuda version 7.5
-- Found CUDA with FP16 support, compiling with torch.CudaHalfTensor
-- CUDA_NVCC_FLAGS: -gencode;arch=compute_30,code=sm_30;-gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_52,code=sm_52;-gencode;arch=compute_52,code=compute_52;-DCUDA_HAS_FP16=1
-- THC_SO_VERSION: 0
-- Configuring done
-- Generating done
-- Build files have been written to: /home/miguel/torch/extra/cutorch/build
[  2%] Building NVCC (Device) object lib/THC/CMakeFiles/THC.dir/THC_generated_THCBlas.cu.o
[  2%] Building NVCC (Device) object lib/THC/CMakeFiles/THC.dir/THC_generated_THCReduceApplyUtils.cu.o
[  3%] Building NVCC (Device) object lib/THC/CMakeFiles/THC.dir/THC_generated_THCHalf.cu.o
[  4%] Building NVCC (Device) object lib/THC/CMakeFiles/THC.dir/THC_generated_THCSleep.cu.o
lib/THC/CMakeFiles/THC.dir/build.make:70: recipe for target 'lib/THC/CMakeFiles/THC.dir/THC_generated_THCBlas.cu.o' failed
lib/THC/CMakeFiles/THC.dir/build.make:77: recipe for target 'lib/THC/CMakeFiles/THC.dir/THC_generated_THCSleep.cu.o' failed
lib/THC/CMakeFiles/THC.dir/build.make:63: recipe for target 'lib/THC/CMakeFiles/THC.dir/THC_generated_THCReduceApplyUtils.cu.o' failed
lib/THC/CMakeFiles/THC.dir/build.make:560: recipe for target 'lib/THC/CMakeFiles/THC.dir/THC_generated_THCHalf.cu.o' failed
CMakeFiles/Makefile2:172: recipe for target 'lib/THC/CMakeFiles/THC.dir/all' failed
Makefile:127: recipe for target 'all' failed

jopts=$(getconf _NPROCESSORS_CONF)

echo "Building on $jopts cores"
cmake -E make_directory build && cd build && cmake .. -DLUALIB= -DLUA_INCDIR=/home/miguel/torch/install/include -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH="/home/miguel/torch/install/bin/.." -DCMAKE_INSTALL_PREFIX="/home/miguel/torch/install/lib/luarocks/rocks/cutorch/scm-1" && make -j$jopts install

jprama · February 5, 2018, 1:03pm

You have a conflict with cuda 7.5 installer on your system.
Can you check if you have a sym link for /usr/local/cuda (ls -l /usr/local/cuda) and if you haven’t any 7.5 cuda packages installed on your system (dpkg -l | grep nvidia) .

tel34 · February 5, 2018, 2:16pm

The sym link /usr/local/cuda points to /usr/local/cuda-9.1 which is where I successfully ran the Cuda tests. However I DO seem to still have cuda 7.5 packages on the system as shown below. They must have got installed when I installed the driver after fitting the GTX 1800Ti but before running your installation procedure. Do you recommend getting rid of them selectively or doing a apt-get remove purge nvidia*.
Thanks,
Output of dpkg -l | grep nvidia:

ii  nvidia-387                            387.34-0ubuntu0~gpu16.04.2                 amd64        NVIDIA binary driver - version 387.34
ii  nvidia-cuda-dev                       7.5.18-0ubuntu1                            amd64        NVIDIA CUDA development files
ii  nvidia-cuda-doc                       7.5.18-0ubuntu1                            all          NVIDIA CUDA and OpenCL documentation
ii  nvidia-cuda-gdb                       7.5.18-0ubuntu1                            amd64        NVIDIA CUDA Debugger (GDB)
ii  nvidia-cuda-toolkit                   7.5.18-0ubuntu1                            amd64        NVIDIA CUDA development toolkit
ii  nvidia-opencl-dev:amd64               7.5.18-0ubuntu1                            amd64        NVIDIA OpenCL development files
ii  nvidia-opencl-icd-387                 387.34-0ubuntu0~gpu16.04.2                 amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                          0.8.2                                      amd64        Tools to enable NVIDIA's Prime
ii  nvidia-profiler                       7.5.18-0ubuntu1                            amd64        NVIDIA Profiler for CUDA and OpenCL
ii  nvidia-settings                       390.12-0ubuntu0~gpu16.04.1                 amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-visual-profiler                7.5.18-0ubuntu1                            amd64        NVIDIA Visual Profiler for CUDA and OpenCL

jprama · February 6, 2018, 9:40am

hello,

You just need to remove the following packages
nvidia-cuda-dev
nvidia-cuda-doc
nvidia-cuda-gdb
nvidia-cuda-toolkit
nvidia-opencl-dev:amd64
nvidia-profiler
nvidia-visual-profiler

tel34 · February 6, 2018, 1:50pm

Thanks - will report progress in a couple of days - have to turn my mind to something different today.

jprama · February 6, 2018, 2:15pm

For information, I could install torch 7 and cuda9 (using ubuntu 16 packages : https://developer.nvidia.com/compute/cuda/9.1/Prod/local_installers/cuda_9.1.85_387.26_linux and https://developer.nvidia.com/compute/cuda/9.1/Prod/patches/1/cuda_9.1.85.1_linux ) on ubuntu 14.04 LTS.
I could launch a training and a translation with openNMT .

tel34 · February 7, 2018, 11:43am

Hi, I am following your procedure to the letter and still failing to complete the build at the “Cuda” stage. The relevant part of the log is pasted below. Any suggestions would be welcome. Thanks.

cd build && make install
Updating manifest for /home/miguel/torch/install/lib/luarocks/rocks
optim 1.0.5-0 is now built and installed in /home/miguel/torch/install/ (license: BSD)

Found CUDA on your machine. Installing CUDA packages
Building on 4 cores
-- Found Torch7 in /home/miguel/torch/install
-- Removing -DNDEBUG from compile flags
-- TH_LIBRARIES: TH
-- MAGMA not found. Compiling without MAGMA support
-- Autodetected CUDA architecture(s): 6.1
-- got cuda version 9.1
-- Found CUDA with FP16 support, compiling with torch.CudaHalfTensor
-- CUDA_NVCC_FLAGS: -gencode;arch=compute_61,code=sm_61;-DCUDA_HAS_FP16=1
-- THC_SO_VERSION: 0
-- Configuring done
-- Generating done
-- Build files have been written to: /home/miguel/torch/extra/cutorch/build
[  1%] Building NVCC (Device) object lib/THC/CMakeFiles/THC.dir/THC_generated_THCTensorMath.cu.o
[  3%] Building NVCC (Device) object lib/THC/CMakeFiles/THC.dir/THC_generated_THCTensorMathMagma.cu.o
[  3%] Building NVCC (Device) object lib/THC/CMakeFiles/THC.dir/THC_generated_THCTensorMathBlas.cu.o
[  4%] Building NVCC (Device) object lib/THC/CMakeFiles/THC.dir/THC_generated_THCTensorMathPairwise.cu.o
lib/THC/CMakeFiles/THC.dir/build.make:3793: recipe for target 'lib/THC/CMakeFiles/THC.dir/THC_generated_THCTensorMathPairwise.cu.o' failed
lib/THC/CMakeFiles/THC.dir/build.make:2901: recipe for target 'lib/THC/CMakeFiles/THC.dir/THC_generated_THCTensorMath.cu.o' failed
CMakeFiles/Makefile2:172: recipe for target 'lib/THC/CMakeFiles/THC.dir/all' failed
Makefile:127: recipe for target 'all' failed

jopts=$(getconf _NPROCESSORS_CONF)

echo "Building on $jopts cores"
cmake -E make_directory build && cd build && cmake .. -DLUALIB= -DLUA_INCDIR=/home/miguel/torch/install/include -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS} -DCMAKE_BUILD_TYPE=Release -DCMAKE_PREFIX_PATH="/home/miguel/torch/install/bin/.." -DCMAKE_INSTALL_PREFIX="/home/miguel/torch/install/lib/luarocks/rocks/cutorch/scm-1" && make -j$jopts install

tel34 · February 8, 2018, 4:22pm

I have followed the prescribed procedure precisely and entered
export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__" immediately before running install.sh. Each time the installation fails at the point shown below. As I have cuda 8 running with a GTX 1070 on another machine for year without problems I am inclined to remove cuda 9 and go back to cuda 8 unless any kind suggestions help.

panosk · February 8, 2018, 4:56pm

Hi @tel34,

The error shows that half operators are still in the play. Try to clean first and rerun it

./clean.sh
export TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__"
./install.sh

tel34 · February 8, 2018, 7:05pm

Hi @panosk,
Yes, I’ve done that. The NVCC building now got to 11% and then failed as shown below. I’ve researched this but have not put my finger on what’s going wrong. It seems not to be retaining the export command. I’ve even tried putting it in bashrc.

panosk · February 8, 2018, 7:12pm

Hm, weird… Are you using bash? Try to run this and see:

echo $0

tel34 · February 8, 2018, 7:17pm

Yes, and I get -bash. I’ve tried it again since I sent the last message and it gets to 11% in the NVCC build and fails.

panosk · February 8, 2018, 7:25pm

Well, last things I can think of is try to source ~/.bashrc if you haven’t logged out/rebooted after you changed it, or put the export in front of install.sh in one line

TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__" ./install.sh

tel34 · February 8, 2018, 7:49pm

Well, @panosk, I owe you a beer :-).
Putting the export in front of install.sh did the trick.
Hopefully tomorrow I can start some new trainings with my GTX 1080Ti.
And thanks to everyone else who offered suggestions!