Hello! I’m currently trying to train a Spanish to English model using yaml scripts. My data set is pretty big but just for starters, I’m trying to get a 10,000 training set and 1000-2000 validation set working well first. However, after trying for days, I think I need help considering that my validation accuracy goes down the more I train while my training accuracy goes up.
My data comes from the ES-EN coronavirus commentary data set from ModelFront found here https://console.modelfront.com/#/evaluations/5e86e34597c1790017d4050a. I found the parallel sentences to be pretty accurate. And I’m using the first 10,000 parallel lines from the dataset, skipping sentences that contain any digits. I then take the next 1000 or 2000 for my validation set and the next 1000 for my test set, only containing sentences without numbers. Upon looking at the data, it looks clean and the sentences are lined up with each other in the respective lines.
I then use sentencepiece to build a vocabulary model. Using the spm_train command, I feed in my English and Spanish training set, comma separated in the argument, and output a single esen.model. In addition, I chose to use unigrams and a vocab size of 16000
As for my yaml configuration file: here is what I specify
My source and target training data (the 10,000 I extracted for English and Spanish with “sentencepiece” in the transforms )
My source and target validation data (2,000 for English and Spanish with “sentencepiece” in the transforms )
My vocab model esen.model for both my Src and target vocab model
Learning rate: 0.001
Training steps: 5000
Valid steps: 1000
Other logging data.
Upon starting the training with onmt_translate, my training accuracy starts off at 7.65 and goes into the low 70s by the time 5000 steps are over. But, in that time frame, my validation accuracy goes from 24 to 19.
I then use bleu to score my test set, which gets a BP of ~0.67.
I noticed that after trying sgd with a learning rate of 1, my validation accuracy kept increasing, but the perplexity started going back up at the end.
I’m wondering if I’m doing anything wrong that would make my validation accuracy go down while my training accuracy goes up? Do I just need to train more? Can anyone recommend anything else to improve this model? I’ve been staring at it for a few days. Anything is appreciated. Thanks.
!spm_train --input=data/spanish_train,data/english_train --model_prefix=data/esen --character_coverage=1 --vocab_size=16000 --model_type=unigram
## Where the samples will be written save_data: en-sp/run/example ## Where the vocab(s) will be written src_vocab: en-sp/run/example.vocab.src tgt_vocab: en-sp/run/example.vocab.tgt ## Where the model will be saved save_model: drive/MyDrive/ESEN/model3_bpe_adam_001_layer2/model # Prevent overwriting existing files in the folder overwrite: False # Corpus opts: data: taus_corona: path_src: data/spanish_train path_tgt: data/english_train transforms: [sentencepiece, filtertoolong] weight: 1 valid: path_src: data/spanish_valid path_tgt: data/english_valid transforms: [sentencepiece] skip_empty_level: silent src_subword_model: data/esen.model tgt_subword_model: data/esen.model # General opts report_every: 100 train_steps: 5000 valid_steps: 1000 save_checkpoint_steps: 1000 world_size: 1 gpu_ranks:  # Optimizer optim: adam learning_rate: 0.001 # Model encoder_type: rnn decoder_type: rnn layers: 2 rnn_type: LSTM bidir_edges: True # Logging tensorboard: true tensorboard_log_dir: logs log_file: logs/log-file.txt verbose: True attn_debug: True align_debug: True global_attention: general global_attention_function: softmax
onmt_build_vocab -config en-sp.yaml -n_sample -1
onmt_train -config en-sp.yaml
Step 1000/ 5000; acc: 27.94; ppl: 71.88; xent: 4.27; lr: 0.00100; 13103/12039 tok/s; 157 sec Validation perplexity: 136.446 Validation accuracy: 24.234 ... Step 4000/ 5000; acc: 61.25; ppl: 5.28; xent: 1.66; lr: 0.00100; 13584/12214 tok/s; 641 sec Validation accuracy: 22.1157 ...