Accuracy increases, stabilises for around 10k steps and then drops drastically

A5U7 · June 13, 2020, 7:29pm

I am training a 2-layer LSTM seq2seq model (en to en) with a character level encoder and a word level decoder with pertained embeddings.

I am working with ~1million lines of data which I am training for 200k steps with Adam optimiser(0.001).

The training starts off pretty well and goes upto 92% accuracy; ppl 1.2; xent: 0.2 at around 10000/200000 and stays in its vicinity till 21000th step. At this point the log shows another load from train data post which the accuracy gradually declines to a mere 25% and stays around the same value.

I have validation runs at every 5000 steps and the validation scores for 5000th and 10000th steps were both around 90% with ppl:1.< > and xent lower than 0.5. which also came down to low 20s after 21k steps.

Can anyone give me an insight into what might be happening? I was wondering if it was because my data is not shuffled, but the validation runs should have come back with a miserable score in that case as it contains around 50k sentences of all sizes.

Edit: I ran translate on the later checkpoint to find out that the model was repeatedly printing the same 10 tokens for all inputs. Can anyone help me understand why that started happening suddenly?

1> ▁the ▁reason ▁to ▁the ▁reason ▁of ▁the ▁world com ▁boss ▁at ▁the 2>▁reason ▁of ▁the ▁world com ▁boss ▁.
3> ▁" ▁the ▁reason ▁of ▁the ▁world com ▁boss
▁- ▁ 1 ▁ 9 ▁ 8 ▁ 1 ▁ 8 ▁ 8 ▁ 8 ▁ 8 ▁ 1 ▁ 8 ▁ 8 ▁ 8 ▁ 8 ▁ 8 ▁ 8 ▁ 8 ▁ 8 ▁ 8 ▁ 8 ▁ 8 ▁ 8 ▁ 7 ▁ 8 ▁ 5
4>▁world ▁’ ▁s ▁reason ▁of ▁the ▁reason ▁of ▁the ▁world com ▁boss ▁.

guillaumekln · June 17, 2020, 7:41am

Is there any known ordering in your training data? Also posting your command lines will probably help.

A5U7 · June 27, 2020, 3:37pm

Hi @guillaumekln. The dataset was shuffled before creating the training text files. Here are the steps I took. This is for a character-level model now. Where instead of a word level decoder, I have hooked in a character-level one.

Preprocessing

python3 …/…/OpenNMT-py/preprocess.py
-train_src char_level_mask_src_train_0.925_threshold.txt
–train_tgt char_level_mask_tgt_train_0.925_threshold.txt
–valid_src char_level_mask_src_test_0.925_threshold.txt
–valid_tgt char_level_mask_tgt_test_0.925_threshold.txt
–save_data data_char_mask
–src_seq_length 256 --tgt_seq_length 256
–report_every 10000 --log_file preprocess_char_mask_pretrained.log

using pre-trained embeddings

…/…/OpenNMT-py/tools/embeddings_to_torch.py
-emb_file_enc src_mask_embed.txt
-emb_file_dec tgt_mask_embed.txt
-dict_file data_char_mask.vocab.pt
-output_file char_mask_pretrained_embedding

Training

python3 …/…/OpenNMT-py/train.py
–src_word_vec_size 128 --tgt_word_vec_size 128
–encoder_type brnn --decoder_type rnn --enc_layers 2
–dec_layers 2 --enc_rnn_size 1024 --dec_rnn_size 1024
–rnn_type LSTM --context_gate both --global_attention general
–data data_char_mask --save_model char_mask_models --world_size 2
–gpu_ranks 0 1 --gpu_verbose_level 0
–pre_word_vecs_enc char_mask_pretrained_embedding.enc.pt
–pre_word_vecs_dec char_mask_pretrained_embedding.dec.pt
–valid_steps 10000 --train_steps 1000000 --early_stopping 10
–optim adam --learning_rate 0.001
–log_file training_LSTM_2_1024_char_mask_model.log

In the attached log file, you can see that validation ppl and accuracy are also in similar range and then after a few steps the perplexity starts increasing.

I am using Adam optimiser with an lr of 0.001.
PS. The validation dataset is a balanced representation of all kinds of data used to make the corpus.

A5U7 · June 27, 2020, 4:34pm

Also, is this normal functionality? I have just one shard of data but there are subsequent loads from the pt file. ( the batch size is 64, so 7.5k steps are not enough to have one epoch over the entire ~1mil lines to have a reload) Moreover there is a step between load dataset INFO and number of example INFO.

francoishernandez · June 29, 2020, 7:20am

This behaviour is quite strange, not sure where it could come from.

As for the “early reloading” of your data it’s because of the pooling mechanism we use to rationalize the dataloading pipeline.
Basically, instead of just taking one batch at a time, we read batch_size * pool_factor examples (pool_factor defaults to 8192), order these by length (to have homogeneous batches), create batches, shuffle these and yield them to the GPUs.

A5U7 · June 29, 2020, 11:12am

hi @francoishernandez
I tried reshuffling my data as well as changed the split ( to include more validation datapoints). But even after the shuffle, I encountered the same behaviour.

It happens when I keep my lr ~0.01 for Adam and (1-0.1) for sgd. A smaller lr, like 1e-4 for Adam doesn’t result in this scenario.

But I cannot figure out what can be going wrong with it theoretically. Because if the lr is too big to locate such a saddle point, it should fluctuate and not stagnate. Also, I don’t see how (even if there are such points) a small set of difficult/peculiar sentence can throw my model so off track that further training cannot improve it, even when the losses incurred are huge!?

Is there something I’m doing wrong while calling the preprocessing or train files?

A5U7 · July 2, 2020, 4:22pm

Hi @guillaumekln could there be any probable explanation to this?

guillaumekln · July 2, 2020, 6:04pm

Can you try with a smaller learning rate (e.g. 0.0002)?

A5U7 · July 4, 2020, 3:51pm

Hi @guillaumekln.

It worked fine with your advised lr range and the model converged.
Thanks for the help!
However I am still not able to make out what might be going wrong with a seq2seq LSTM model with standard learning rates ( 0.01 for Adam and 1 for sgd)

guillaumekln · July 6, 2020, 7:33am

Did you mean 0.001? I think this is a bit too high for RNN models. In OpenNMT-lua we recommend the value 0.0002:

github.com

OpenNMT/OpenNMT/blob/master/onmt/train/Optim.lua#L34


  '-optim', 'sgd',
  [[Optimization method.]],
  {
    enum = {'sgd', 'adagrad', 'adadelta', 'adam'},
    train_state = true
  }
},
{
  '-learning_rate', 1,
  [[Initial learning rate. If `adagrad` or `adam` is used, then this is the global learning rate.
    Recommended settings are: `sgd` = 1, `adagrad` = 0.1, `adam` = 0.0002.]],
  {
    train_state = true
  }
},
{
  '-min_learning_rate', 0,
  [[Do not continue the training past this learning rate value.]],
  {
    train_state = true
  }

So your training was simply diverging.