Hello! I’m currently trying to train a Spanish to English model using yaml scripts. My data set is pretty big but just for starters, I’m trying to get a 10,000 training set and 1000-2000 validation set working well first. However, after trying for days, I think I need help considering that my validation accuracy goes down the more I train while my training accuracy goes up.
My data comes from the ES-EN coronavirus commentary data set from ModelFront found here https://console.modelfront.com/#/evaluations/5e86e34597c1790017d4050a. I found the parallel sentences to be pretty accurate. And I’m using the first 10,000 parallel lines from the dataset, skipping sentences that contain any digits. I then take the next 1000 or 2000 for my validation set and the next 1000 for my test set, only containing sentences without numbers. Upon looking at the data, it looks clean and the sentences are lined up with each other in the respective lines.
I then use sentencepiece to build a vocabulary model. Using the spm_train command, I feed in my English and Spanish training set, comma separated in the argument, and output a single esen.model. In addition, I chose to use unigrams and a vocab size of 16000
As for my yaml configuration file: here is what I specify
My source and target training data (the 10,000 I extracted for English and Spanish with “sentencepiece” in the transforms [])
My source and target validation data (2,000 for English and Spanish with “sentencepiece” in the transforms [])
My vocab model esen.model for both my Src and target vocab model
Encoder: rnn
Decoder: rnn
Type: LSTM
Layers: 2
bidir: true
Optim: Adam
Learning rate: 0.001
Training steps: 5000
Valid steps: 1000
Other logging data.
Upon starting the training with onmt_translate, my training accuracy starts off at 7.65 and goes into the low 70s by the time 5000 steps are over. But, in that time frame, my validation accuracy goes from 24 to 19.
I then use bleu to score my test set, which gets a BP of ~0.67.
I noticed that after trying sgd with a learning rate of 1, my validation accuracy kept increasing, but the perplexity started going back up at the end.
I’m wondering if I’m doing anything wrong that would make my validation accuracy go down while my training accuracy goes up? Do I just need to train more? Can anyone recommend anything else to improve this model? I’ve been staring at it for a few days. Anything is appreciated. Thanks.
!spm_train --input=data/spanish_train,data/english_train --model_prefix=data/esen --character_coverage=1 --vocab_size=16000 --model_type=unigram
## Where the samples will be written
save_data: en-sp/run/example
## Where the vocab(s) will be written
src_vocab: en-sp/run/example.vocab.src
tgt_vocab: en-sp/run/example.vocab.tgt
## Where the model will be saved
save_model: drive/MyDrive/ESEN/model3_bpe_adam_001_layer2/model
# Prevent overwriting existing files in the folder
overwrite: False
# Corpus opts:
data:
taus_corona:
path_src: data/spanish_train
path_tgt: data/english_train
transforms: [sentencepiece, filtertoolong]
weight: 1
valid:
path_src: data/spanish_valid
path_tgt: data/english_valid
transforms: [sentencepiece]
skip_empty_level: silent
src_subword_model: data/esen.model
tgt_subword_model: data/esen.model
# General opts
report_every: 100
train_steps: 5000
valid_steps: 1000
save_checkpoint_steps: 1000
world_size: 1
gpu_ranks: [0]
# Optimizer
optim: adam
learning_rate: 0.001
# Model
encoder_type: rnn
decoder_type: rnn
layers: 2
rnn_type: LSTM
bidir_edges: True
# Logging
tensorboard: true
tensorboard_log_dir: logs
log_file: logs/log-file.txt
verbose: True
attn_debug: True
align_debug: True
global_attention: general
global_attention_function: softmax
onmt_build_vocab -config en-sp.yaml -n_sample -1
onmt_train -config en-sp.yaml
Step 1000/ 5000; acc: 27.94; ppl: 71.88; xent: 4.27; lr: 0.00100; 13103/12039 tok/s; 157 sec
Validation perplexity: 136.446
Validation accuracy: 24.234
...
Step 4000/ 5000; acc: 61.25; ppl: 5.28; xent: 1.66; lr: 0.00100; 13584/12214 tok/s; 641 sec
Validation accuracy: 22.1157
...