I have run my model. my report is printed every 200 steps but up to 1200 steps, after that printing process is freezing!!!
I have run the model again but the same problem !!
my report:
[2020-03-28 14:54:19,715 INFO] Loading dataset from data/data.train.0.pt
[2020-03-28 14:54:20,428 INFO] * src vocab size = 347030
[2020-03-28 14:54:20,429 INFO] * tgt vocab size = 82235
[2020-03-28 14:54:20,429 INFO] Building model…
[2020-03-28 14:54:27,849 INFO] NMTModel(
(encoder): RNNEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(347030, 448, padding_idx=1)
)
)
)
(rnn): LSTM(448, 64, num_layers=3, dropout=0.2)
)
(decoder): InputFeedRNNDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(82235, 672, padding_idx=1)
)
)
)
(dropout): Dropout(p=0.2, inplace=False)
(rnn): StackedLSTM(
(dropout): Dropout(p=0.2, inplace=False)
(layers): ModuleList(
(0): LSTMCell(736, 64)
(1): LSTMCell(64, 64)
(2): LSTMCell(64, 64)
)
)
(attn): GlobalAttention(
(linear_out): Linear(in_features=128, out_features=64, bias=False)
)
)
(generator): Sequential(
(0): Linear(in_features=64, out_features=82235, bias=True)
(1): Cast()
(2): LogSoftmax()
)
)
[2020-03-28 14:54:27,850 INFO] encoder: 155667584
[2020-03-28 14:54:27,850 INFO] decoder: 60887259
[2020-03-28 14:54:27,850 INFO] * number of parameters: 216554843
[2020-03-28 14:54:27,853 INFO] Starting training on GPU: [0, 1]
[2020-03-28 14:54:27,853 INFO] Start training loop and validate every 2545 steps…
[2020-03-28 14:54:30,868 INFO] number of examples: 325817
[2020-03-28 14:55:12,201 INFO] Step 100/254544; acc: 6.32; ppl: 6125.12; xent: 8.72; lr: 0.50000; 40350/8076 tok/s; 44 sec
[2020-03-28 14:55:44,798 INFO] Step 200/254544; acc: 7.54; ppl: 2023.98; xent: 7.61; lr: 0.50000; 57013/10825 tok/s; 77 sec
[2020-03-28 14:56:17,312 INFO] Step 300/254544; acc: 7.75; ppl: 1587.35; xent: 7.37; lr: 0.50000; 54418/11052 tok/s; 109 sec
[2020-03-28 14:56:49,802 INFO] Step 400/254544; acc: 10.20; ppl: 960.99; xent: 6.87; lr: 0.50000; 55627/10926 tok/s; 142 sec
[2020-03-28 14:57:23,729 INFO] Step 500/254544; acc: 11.84; ppl: 755.28; xent: 6.63; lr: 0.50000; 51044/10705 tok/s; 176 sec
[2020-03-28 14:57:59,734 INFO] Step 600/254544; acc: 14.12; ppl: 593.81; xent: 6.39; lr: 0.50000; 51412/9365 tok/s; 212 sec
[2020-03-28 14:58:32,040 INFO] Step 700/254544; acc: 14.59; ppl: 481.08; xent: 6.18; lr: 0.50000; 50973/10727 tok/s; 244 sec
[2020-03-28 14:59:04,086 INFO] Step 800/254544; acc: 15.33; ppl: 448.16; xent: 6.11; lr: 0.50000; 51477/10625 tok/s; 276 sec
[2020-03-28 14:59:36,171 INFO] Step 900/254544; acc: 15.41; ppl: 418.66; xent: 6.04; lr: 0.50000; 52000/11017 tok/s; 308 sec
[2020-03-28 15:00:08,246 INFO] Step 1000/254544; acc: 16.65; ppl: 372.11; xent: 5.92; lr: 0.50000; 54705/10911 tok/s; 340 sec
[2020-03-28 15:00:39,409 INFO] Step 1100/254544; acc: 17.21; ppl: 357.27; xent: 5.88; lr: 0.50000; 55356/11226 tok/s; 372 sec
[2020-03-28 15:01:14,593 INFO] Step 1200/254544; acc: 17.09; ppl: 336.30; xent: 5.82; lr: 0.50000; 54644/10244 tok/s; 407 sec
[2020-03-28 15:01:25,220 INFO] Loading dataset from data/data.train.0.pt
[2020-03-28 15:01:43,953 INFO] number of examples: 325817
my parameters are:
python train.py -batch_size 128
-accum_count 1
-report_every 100
-layers 3
-world_size 2
-gpu_ranks 0 1
-rnn_size 64
-data data/data
-pre_word_vecs_enc “data/embeddings.enc.pt”
-pre_word_vecs_dec “data/embeddings.dec.pt”
-src_word_vec_size 448
-tgt_word_vec_size 672
-fix_word_vecs_enc
-fix_word_vecs_dec
-save_model data/model_3_layer_
-save_checkpoint_steps 2000
-train_steps 254544
-model_type text
-encoder_type rnn
-decoder_type rnn
-rnn_type LSTM
-global_attention dot
-global_attention_function softmax
-early_stopping 10
-optim sgd
-learning_rate 0.5
-valid_steps 2545
-dropout .2
-attention_dropout .3