How long should the training with the default data take?

martin_wun · July 10, 2018, 8:57am

Hello,

I have installed the pytorch version of OpenNMT and I am currently trying to run the training with the demo data set. However, after several hours of training, the accuracy still seems quite low and the training does not terminate.
Here are some details from the console output:

[2018-07-10 07:07:02,010 INFO] Loading train dataset from data/demo.train.pt, number of examples: 10000
[2018-07-10 07:10:33,008 INFO] Step 650, 100000; acc:  11.95; ppl: 1023.88; xent:   6.93; lr: 1.00000; 277 / 285 tok/s;   2461 sec
[2018-07-10 07:25:53,663 INFO] Step 700, 100000; acc:  12.59; ppl: 810.70; xent:   6.70; lr: 1.00000; 404 / 361 tok/s;   3381 sec
[2018-07-10 07:45:28,219 INFO] Step 750, 100000; acc:  13.16; ppl: 720.82; xent:   6.58; lr: 1.00000; 252 / 278 tok/s;   4556 sec
[2018-07-10 08:00:11,008 INFO] Loading train dataset from data/demo.train.pt, number of examples: 10000
[2018-07-10 08:04:31,541 INFO] Step 800, 100000; acc:  14.44; ppl: 640.37; xent:   6.46; lr: 1.00000;  26 /  28 tok/s;   5699 sec
[2018-07-10 08:20:39,066 INFO] Step 850, 100000; acc:  12.50; ppl: 810.22; xent:   6.70; lr: 1.00000; 584 / 524 tok/s;   6667 sec
[2018-07-10 08:40:13,886 INFO] Step 900, 100000; acc:  18.80; ppl: 698.56; xent:   6.55; lr: 1.00000; 258 / 249 tok/s;   7842 sec

The code is running on an Amazon AWS VIRTUAL machine, based on one of the preconfigured Deep Learning AMIs. The specs are: instance ID “t2.xlarge”, 4 CPUs, 16 GB RAM

So, I have two questions:

What is the termination criterion for the default training?
How long would it normally take to get there?
Thanks a lot.

Kind regards,

Martin

Belerafon · August 10, 2018, 4:39pm

100000 steps?
258 / 249 tok/s - very slow, try to use GPU.
In my case FloydHUB with 1 GPU gives me 5082/4288 tok/s; It takes aprox 10 h to train and costs 12$.

martin_wun · September 13, 2018, 6:36am

Hi Maxim,

Thanks a lot fo the reply. I’ll try to use a VM with GPU support. I presume “tok/s” means “tokens per second”?
Cheers,

Martin

martin_wun · September 13, 2018, 12:03pm

OK, so after running the training for 4 hours on a specialised GPU AWS instance (Instance type: g2.2xlarge, 15 GiB, 26 units, 8 vCPUs), the training ist still only at step 4300 of 100000. So, if it continues like this, the basic training will take 93 hours.

Does anyone have a recommended EC2 instance for running the training? Or does it need some kind of special configuration to make use of the GPU(s)?

Cheers,

Martin

[2018-09-13 10:33:22,354 INFO] Step 3850/100000; acc:  35.45; ppl: 31.06; xent: 3.44; lr: 1.00000; 578/489 tok/s;  12563 sec
[2018-09-13 10:36:10,197 INFO] Step 3900/100000; acc:  34.87; ppl: 32.94; xent: 3.49; lr: 1.00000; 503/463 tok/s;  12731 sec
[2018-09-13 10:37:35,949 INFO] Loading train dataset from data/demo.train.1.pt, number of examples: 10000
[2018-09-13 10:38:54,160 INFO] Step 3950/100000; acc:  66.67; ppl:  4.14; xent: 1.42; lr: 1.00000; 494/537 tok/s;  12894 sec
[2018-09-13 10:41:26,968 INFO] Step 4000/100000; acc:  32.04; ppl: 42.02; xent: 3.74; lr: 1.00000; 513/480 tok/s;  13047 sec
[2018-09-13 10:44:08,351 INFO] Step 4050/100000; acc:  55.29; ppl:  7.29; xent: 1.99; lr: 1.00000; 289/335 tok/s;  13209 sec
[2018-09-13 10:45:57,715 INFO] Loading train dataset from data/demo.train.1.pt, number of examples: 10000
[2018-09-13 10:46:53,095 INFO] Step 4100/100000; acc:  37.05; ppl: 23.66; xent: 3.16; lr: 1.00000; 397/363 tok/s;  13373 sec
[2018-09-13 10:49:24,377 INFO] Step 4150/100000; acc:  43.66; ppl: 18.10; xent: 2.90; lr: 1.00000; 629/526 tok/s;  13525 sec
[2018-09-13 10:52:13,749 INFO] Step 4200/100000; acc:  50.77; ppl: 10.87; xent: 2.39; lr: 1.00000; 483/553 tok/s;  13694 sec
[2018-09-13 10:54:35,846 INFO] Loading train dataset from data/demo.train.1.pt, number of examples: 10000
[2018-09-13 10:55:06,841 INFO] Step 4250/100000; acc:  46.41; ppl: 13.22; xent: 2.58; lr: 1.00000; 305/398 tok/s;  13867 sec
[2018-09-13 10:57:52,525 INFO] Step 4300/100000; acc:  51.76; ppl:  9.14; xent: 2.21; lr: 1.00000; 278/256 tok/s;  14033 sec

guillaumekln · September 14, 2018, 8:14am

What command did you run for the training? Looks like the training is running on the CPU.

martin_wun · September 14, 2018, 9:25am

I used the command provided in the quickstart guide (pytorch version):

python train.py -data data/demo -save_model demo-model

guillaumekln · September 14, 2018, 9:31am

From the quickstart:

The main train command is quite simple. Minimally it takes a data file and a save file. This will run the default model, which consists of a 2-layer LSTM with 500 hidden units on both the encoder/decoder. You can also add -gpuid 1 to use (say) GPU 1.

http://opennmt.net/OpenNMT-py/quickstart.html#step-2-train-the-model

martin_wun · September 14, 2018, 9:42am

Ah, bummer. I missed the bit about the GPU parameter. Sorry about this.

I have now added the GPU to the command and it works fine. My throughput is now 10 times of what it was before on the EC2 P2 instance.

The G2 instance, however, doesn’t work and gives the following error:

[2018-09-14 09:34:03,787 INFO] Building model...
/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/cuda/__init__.py:116: UserWarning: 
    Found GPU0 GRID K520 which is of cuda capability 3.0.
    PyTorch no longer supports this GPU because it is too old.

Consequently, the training fails to start on this image with the following message:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1532579245307/work/aten/src/THC/THCTensorCopy.cu line=206 error=48 : no kernel image is available for execution on the device

But that’s just a matter of selection the P2 instance over the G2 instance.
Thanks so much, Guilaume.