I have installed the pytorch version of OpenNMT and I am currently trying to run the training with the demo data set. However, after several hours of training, the accuracy still seems quite low and the training does not terminate.
Here are some details from the console output:
The code is running on an Amazon AWS VIRTUAL machine, based on one of the preconfigured Deep Learning AMIs. The specs are: instance ID “t2.xlarge”, 4 CPUs, 16 GB RAM
So, I have two questions:
What is the termination criterion for the default training?
How long would it normally take to get there?
Thanks a lot.
100000 steps?
258 / 249 tok/s - very slow, try to use GPU.
In my case FloydHUB with 1 GPU gives me 5082/4288 tok/s; It takes aprox 10 h to train and costs 12$.
OK, so after running the training for 4 hours on a specialised GPU AWS instance (Instance type: g2.2xlarge, 15 GiB, 26 units, 8 vCPUs), the training ist still only at step 4300 of 100000. So, if it continues like this, the basic training will take 93 hours.
Does anyone have a recommended EC2 instance for running the training? Or does it need some kind of special configuration to make use of the GPU(s)?
The main train command is quite simple. Minimally it takes a data file and a save file. This will run the default model, which consists of a 2-layer LSTM with 500 hidden units on both the encoder/decoder. You can also add -gpuid 1 to use (say) GPU 1.
Ah, bummer. I missed the bit about the GPU parameter. Sorry about this.
I have now added the GPU to the command and it works fine. My throughput is now 10 times of what it was before on the EC2 P2 instance.
The G2 instance, however, doesn’t work and gives the following error:
[2018-09-14 09:34:03,787 INFO] Building model...
/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/cuda/__init__.py:116: UserWarning:
Found GPU0 GRID K520 which is of cuda capability 3.0.
PyTorch no longer supports this GPU because it is too old.
Consequently, the training fails to start on this image with the following message:
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1532579245307/work/aten/src/THC/THCTensorCopy.cu line=206 error=48 : no kernel image is available for execution on the device
But that’s just a matter of selection the P2 instance over the G2 instance.
Thanks so much, Guilaume.