Training stops after creating first check point


(karimkhan) #1

I started training couple of time, once I tried to continue from previous check point as well. But training stops at this stage -

Train command -

nohup python train.py -data data/demo -save_model demo-model -batch_size 256 -train_steps 10000 -save_checkpoint_steps 10000 -gpu_ranks 0 -world_size 1 -keep_checkpoint 5 &

[2018-10-23 12:26:52,515 INFO] Step 9950/10000; acc: 76.63; ppl: 2.88; xent: 1.06; lr: 1.00000; 5383/4866 tok/s; 10958 sec
[2018-10-23 12:27:47,610 INFO] Step 10000/10000; acc: 90.34; ppl: 1.38; xent: 0.32; lr: 1.00000; 1331/2068 tok/s; 11013 sec
[2018-10-23 12:27:47,769 INFO] Loading valid dataset from data/demo.valid.0.pt, number of examples: 9411
[2018-10-23 12:28:21,403 INFO] Validation perplexity: 29.3513
[2018-10-23 12:28:21,403 INFO] Validation accuracy: 52.7081
[2018-10-23 12:28:21,403 INFO] Saving checkpoint demo-model_step_10000.pt
[2018-10-23 12:28:22,712 INFO] Loading train dataset from data/demo.train.0.pt, number of examples: 47012
[2018-10-23 13:36:06,938 INFO] Loading checkpoint from demo-model_step_10000.pt
[2018-10-23 13:36:07,730 INFO] Loading train dataset from data/demo.train.0.pt, number of examples: 47012

Then I restarted training from previous check point

With this command -

nohup python train.py -data data/demo -save_model demo-model -train_from demo-model_step_100000.pt -world_size 1 -gpu_ranks 0 -batch_size 256 -train_steps 10000 -keep_checkpoint 5 &

[2018-10-23 13:45:56,161 INFO] Loading train dataset from data/demo.train.0.pt, number of examples: 47012
[2018-10-23 13:45:56,236 INFO] * vocabulary size. source = 45202; target = 50004
[2018-10-23 13:45:56,236 INFO] Building model…
[2018-10-23 13:45:59,831 INFO] NMTModel(
(encoder): RNNEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(45202, 500, padding_idx=1)
)
)
)
(rnn): LSTM(500, 500, num_layers=2, dropout=0.3)
)
(decoder): InputFeedRNNDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(50004, 500, padding_idx=1)
)
)
)
(dropout): Dropout(p=0.3)
(rnn): StackedLSTM(
(dropout): Dropout(p=0.3)
(layers): ModuleList(
(0): LSTMCell(1000, 500)
(1): LSTMCell(500, 500)
)
)
(attn): GlobalAttention(
(linear_in): Linear(in_features=500, out_features=500, bias=False)
(linear_out): Linear(in_features=1000, out_features=500, bias=False)
)
)
(generator): Sequential(
(0): Linear(in_features=500, out_features=50004, bias=True)
(1): LogSoftmax()
)
)
[2018-10-23 13:45:59,832 INFO] encoder: 26609000
[2018-10-23 13:45:59,832 INFO] decoder: 55812004
[2018-10-23 13:45:59,832 INFO] * number of parameters: 82421004
[2018-10-23 13:45:59,833 INFO] Start training…
[2018-10-23 13:46:00,517 INFO] Loading train dataset from data/demo.train.0.pt, number of examples: 47012

Am I doing anything wrong?


(Vincent Nguyen) #2

your train_from has to be 10000 not 100000 but maybe this is a typo

more important you need to put a greater number of steps. it does not re-start from 0.