I ran my model on a cluster(HPC) with 2 nodes
here is the .sh file
#!/bin/sh
#SBATCH --job-name=My_Model
#SBATCH --output=model_result.txt
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --time=90:00:00
#SBATCH --account=g.alex052
python train.py -batch_size 128
-accum_count 1
-report_every 100
-layers 1
-world_size 2
-gpu_ranks 0 1
-rnn_size 256
-data data/data
-pre_word_vecs_enc “data/embeddings.enc.pt”
-pre_word_vecs_dec “data/embeddings.dec.pt”
-src_word_vec_size 448
-tgt_word_vec_size 672
-fix_word_vecs_enc
-fix_word_vecs_dec
-save_model data/model_2_layer
-save_checkpoint_steps 2000
-train_steps 199300
-model_type text
-encoder_type rnn
-decoder_type rnn
-rnn_type LSTM
-global_attention dot
-global_attention_function softmax
-early_stopping 10
-optim sgd
-learning_rate 0.5
-valid_steps 2000
-dropout .2
-attention_dropout .3
“model_result.txt” is output of my model and its content :
[2020-04-11 19:05:57,031 INFO] * src vocab size = 278744
[2020-04-11 19:05:57,031 INFO] * tgt vocab size = 65461
[2020-04-11 19:05:57,032 INFO] Building model…
[2020-04-11 19:05:57,430 INFO] Loading dataset from data/data.train.0.pt
/share/apps/conda_envs/ba-hpc/lib/python3.6/site-packages/torch/nn/modules/rnn.py:51: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1
“num_layers={}”.format(dropout, num_layers))
[2020-04-11 19:06:03,454 INFO] NMTModel(
(encoder): RNNEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(278744, 448, padding_idx=1)
)
)
)
(rnn): LSTM(448, 256, dropout=0.2)
)
(decoder): InputFeedRNNDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(65461, 672, padding_idx=1)
)
)
)
(dropout): Dropout(p=0.2, inplace=False)
(rnn): StackedLSTM(
(dropout): Dropout(p=0.2, inplace=False)
(layers): ModuleList(
(0): LSTMCell(928, 256)
)
)
(attn): GlobalAttention(
(linear_out): Linear(in_features=512, out_features=256, bias=False)
)
)
(generator): Sequential(
(0): Linear(in_features=256, out_features=65461, bias=True)
(1): Cast()
(2): LogSoftmax()
)
)
[2020-04-11 19:06:03,455 INFO] encoder: 125600256
[2020-04-11 19:06:03,455 INFO] decoder: 62158805
[2020-04-11 19:06:03,455 INFO] * number of parameters: 187759061
[2020-04-11 19:06:03,457 INFO] Starting training on GPU: [0, 1]
[2020-04-11 19:06:03,457 INFO] Start training loop and validate every 2000 steps…
[2020-04-11 19:06:06,009 INFO] number of examples: 255105
[2020-04-11 19:06:48,127 INFO] Step 100/199300; acc: 11.95; ppl: 3099.19; xent: 8.04; lr: 0.50000; 34619/7053 tok/s; 45 sec
[2020-04-11 19:07:27,743 INFO] Step 200/199300; acc: 15.87; ppl: 777.53; xent: 6.66; lr: 0.50000; 48483/8016 tok/s; 84 sec
[2020-04-11 19:08:09,847 INFO] Step 300/199300; acc: 17.18; ppl: 499.48; xent: 6.21; lr: 0.50000; 43892/7801 tok/s; 126 sec
[2020-04-11 19:08:51,191 INFO] Step 400/199300; acc: 18.01; ppl: 405.38; xent: 6.00; lr: 0.50000; 41367/7978 tok/s; 168 sec
[2020-04-11 19:09:31,660 INFO] Step 500/199300; acc: 18.65; ppl: 360.15; xent: 5.89; lr: 0.50000; 43266/8430 tok/s; 208 sec
[2020-04-11 19:10:10,385 INFO] Step 600/199300; acc: 19.94; ppl: 295.57; xent: 5.69; lr: 0.50000; 45529/8491 tok/s; 247 sec
[2020-04-11 19:10:48,476 INFO] Step 700/199300; acc: 21.21; ppl: 252.27; xent: 5.53; lr: 0.50000; 45170/8182 tok/s; 285 sec
[2020-04-11 19:11:27,806 INFO] Step 800/199300; acc: 20.93; ppl: 244.93; xent: 5.50; lr: 0.50000; 45322/8119 tok/s; 324 sec
[2020-04-11 19:12:04,717 INFO] Step 900/199300; acc: 22.16; ppl: 219.50; xent: 5.39; lr: 0.50000; 46047/8663 tok/s; 361 sec
[2020-04-11 19:12:25,957 INFO] Loading dataset from data/data.train.0.pt
[2020-04-11 19:12:39,129 INFO] number of examples: 255105
[2020-04-11 19:12:43,528 INFO] Step 1000/199300; acc: 22.04; ppl: 213.52; xent: 5.36; lr: 0.50000; 45280/8442 tok/s; 400 sec
[2020-04-11 19:13:20,015 INFO] Step 1100/199300; acc: 23.19; ppl: 183.76; xent: 5.21; lr: 0.50000; 42228/8582 tok/s; 437 sec
[2020-04-11 19:14:00,869 INFO] Step 1200/199300; acc: 22.92; ppl: 187.18; xent: 5.23; lr: 0.50000; 47687/7848 tok/s; 477 sec
[2020-04-11 19:14:42,735 INFO] Step 1300/199300; acc: 23.39; ppl: 174.96; xent: 5.16; lr: 0.50000; 43457/7839 tok/s; 519 sec
[2020-04-11 19:15:25,017 INFO] Step 1400/199300; acc: 23.04; ppl: 177.58; xent: 5.18; lr: 0.50000; 42196/7800 tok/s; 562 sec
[2020-04-11 19:16:04,075 INFO] Step 1500/199300; acc: 23.09; ppl: 170.10; xent: 5.14; lr: 0.50000; 42833/8761 tok/s; 601 sec
[2020-04-11 19:16:43,874 INFO] Step 1600/199300; acc: 23.89; ppl: 156.16; xent: 5.05; lr: 0.50000; 45078/8194 tok/s; 640 sec
[2020-04-11 19:17:22,106 INFO] Step 1700/199300; acc: 25.32; ppl: 142.29; xent: 4.96; lr: 0.50000; 45292/8208 tok/s; 679 sec
[2020-04-11 19:18:00,531 INFO] Step 1800/199300; acc: 24.89; ppl: 141.99; xent: 4.96; lr: 0.50000; 45752/8269 tok/s; 717 sec
[2020-04-11 19:18:37,225 INFO] Step 1900/199300; acc: 25.44; ppl: 135.13; xent: 4.91; lr: 0.50000; 45776/8679 tok/s; 754 sec
[2020-04-11 19:18:57,488 INFO] Loading dataset from data/data.train.0.pt
[2020-04-11 19:19:06,595 INFO] number of examples: 255105
[2020-04-11 19:19:16,866 INFO] Step 2000/199300; acc: 24.79; ppl: 141.56; xent: 4.95; lr: 0.50000; 45610/8336 tok/s; 793 sec
[2020-04-11 19:19:16,866 INFO] Loading dataset from data/data.valid.0.pt
[2020-04-11 19:19:18,024 INFO] number of examples: 36395
[2020-04-11 19:22:55,550 INFO] Validation perplexity: 195.471
[2020-04-11 19:22:55,550 INFO] Validation accuracy: 21.0748
[2020-04-11 19:22:55,551 INFO] Model is improving ppl: inf --> 195.471.
[2020-04-11 19:22:55,551 INFO] Model is improving acc: -inf --> 21.0748.
[2020-04-11 19:22:57,880 INFO] Saving checkpoint data/model_2_layer_step_2000.pt
[2020-04-11 19:23:36,694 INFO] Step 2100/199300; acc: 26.70; ppl: 118.17; xent: 4.77; lr: 0.50000; 5786/1200 tok/s; 1053 sec
[2020-04-11 19:24:18,281 INFO] Step 2200/199300; acc: 25.77; ppl: 127.18; xent: 4.85; lr: 0.50000; 47819/7707 tok/s; 1095 sec
[2020-04-11 19:24:58,707 INFO] Step 2300/199300; acc: 26.27; ppl: 121.36; xent: 4.80; lr: 0.50000; 43737/8067 tok/s; 1135 sec
[2020-04-11 19:25:41,255 INFO] Step 2400/199300; acc: 25.51; ppl: 126.97; xent: 4.84; lr: 0.50000; 41536/7793 tok/s; 1178 sec
[2020-04-11 19:26:20,310 INFO] Step 2500/199300; acc: 25.53; ppl: 123.98; xent: 4.82; lr: 0.50000; 43334/8684 tok/s; 1217 sec
[2020-04-11 19:27:00,399 INFO] Step 2600/199300; acc: 26.36; ppl: 115.93; xent: 4.75; lr: 0.50000; 44896/8227 tok/s; 1257 sec
[2020-04-11 19:27:38,719 INFO] Step 2700/199300; acc: 27.66; ppl: 105.92; xent: 4.66; lr: 0.50000; 45641/8157 tok/s; 1295 sec
[2020-04-11 19:28:17,196 INFO] Step 2800/199300; acc: 27.15; ppl: 108.61; xent: 4.69; lr: 0.50000; 44987/8259 tok/s; 1334 sec
[2020-04-11 19:28:53,504 INFO] Step 2900/199300; acc: 27.31; ppl: 104.47; xent: 4.65; lr: 0.50000; 46097/8717 tok/s; 1370 sec
[2020-04-11 19:29:13,039 INFO] Loading dataset from data/data.train.0.pt
[2020-04-11 19:29:23,985 INFO] number of examples: 255105
[2020-04-11 19:29:33,581 INFO] Step 3000/199300; acc: 26.68; ppl: 110.72; xent: 4.71; lr: 0.50000; 45872/8281 tok/s; 1410 sec
[2020-04-11 19:30:08,897 INFO] Step 3100/199300; acc: 28.85; ppl: 91.91; xent: 4.52; lr: 0.50000; 41543/8837 tok/s; 1445 sec
[2020-04-11 19:30:50,875 INFO] Step 3200/199300; acc: 27.38; ppl: 103.95; xent: 4.64; lr: 0.50000; 48691/7683 tok/s; 1487 sec
[2020-04-11 19:31:31,723 INFO] Step 3300/199300; acc: 27.90; ppl: 98.49; xent: 4.59; lr: 0.50000; 43026/7895 tok/s; 1528 sec
[2020-04-11 19:32:14,113 INFO] Step 3400/199300; acc: 26.93; ppl: 104.08; xent: 4.65; lr: 0.50000; 41998/7905 tok/s; 1571 sec
[2020-04-11 19:32:53,570 INFO] Step 3500/199300; acc: 27.03; ppl: 102.11; xent: 4.63; lr: 0.50000; 42612/8574 tok/s; 1610 sec
[2020-04-11 19:33:33,079 INFO] Step 3600/199300; acc: 28.03; ppl: 95.52; xent: 4.56; lr: 0.50000; 45031/8271 tok/s; 1650 sec
[2020-04-11 19:34:11,302 INFO] Step 3700/199300; acc: 29.03; ppl: 88.82; xent: 4.49; lr: 0.50000; 45599/8211 tok/s; 1688 sec
[2020-04-11 19:34:49,473 INFO] Step 3800/199300; acc: 28.39; ppl: 91.30; xent: 4.51; lr: 0.50000; 45479/8309 tok/s; 1726 sec
[2020-04-11 19:35:25,918 INFO] Step 3900/199300; acc: 28.41; ppl: 90.00; xent: 4.50; lr: 0.50000; 45620/8812 tok/s; 1762 sec
[2020-04-11 19:35:44,463 INFO] Loading dataset from data/data.train.0.pt
[2020-04-11 19:35:58,129 INFO] number of examples: 255105
AS we see, printing stops at the second epoch and try to change many parameters for this but didn’t work what is the problem? and how to solve it?