Help with Spanish to English Model

Hello! I’m currently trying to train a Spanish to English model using yaml scripts. My data set is pretty big but just for starters, I’m trying to get a 10,000 training set and 1000-2000 validation set working well first. However, after trying for days, I think I need help considering that my validation accuracy goes down the more I train while my training accuracy goes up.

My data comes from the ES-EN coronavirus commentary data set from ModelFront found here https://console.modelfront.com/#/evaluations/5e86e34597c1790017d4050a. I found the parallel sentences to be pretty accurate. And I’m using the first 10,000 parallel lines from the dataset, skipping sentences that contain any digits. I then take the next 1000 or 2000 for my validation set and the next 1000 for my test set, only containing sentences without numbers. Upon looking at the data, it looks clean and the sentences are lined up with each other in the respective lines.

I then use sentencepiece to build a vocabulary model. Using the spm_train command, I feed in my English and Spanish training set, comma separated in the argument, and output a single esen.model. In addition, I chose to use unigrams and a vocab size of 16000

As for my yaml configuration file: here is what I specify

My source and target training data (the 10,000 I extracted for English and Spanish with “sentencepiece” in the transforms [])

My source and target validation data (2,000 for English and Spanish with “sentencepiece” in the transforms [])

My vocab model esen.model for both my Src and target vocab model

Encoder: rnn
Decoder: rnn
Type: LSTM
Layers: 2
bidir: true

Optim: Adam
Learning rate: 0.001

Training steps: 5000
Valid steps: 1000

Other logging data.

Upon starting the training with onmt_translate, my training accuracy starts off at 7.65 and goes into the low 70s by the time 5000 steps are over. But, in that time frame, my validation accuracy goes from 24 to 19.

I then use bleu to score my test set, which gets a BP of ~0.67.

I noticed that after trying sgd with a learning rate of 1, my validation accuracy kept increasing, but the perplexity started going back up at the end.

I’m wondering if I’m doing anything wrong that would make my validation accuracy go down while my training accuracy goes up? Do I just need to train more? Can anyone recommend anything else to improve this model? I’ve been staring at it for a few days. Anything is appreciated. Thanks.

!spm_train --input=data/spanish_train,data/english_train --model_prefix=data/esen --character_coverage=1 --vocab_size=16000 --model_type=unigram

## Where the samples will be written
save_data: en-sp/run/example

## Where the vocab(s) will be written
src_vocab: en-sp/run/example.vocab.src
tgt_vocab: en-sp/run/example.vocab.tgt

## Where the model will be saved
save_model: drive/MyDrive/ESEN/model3_bpe_adam_001_layer2/model

# Prevent overwriting existing files in the folder
overwrite: False

# Corpus opts:
data:
    taus_corona:
        path_src: data/spanish_train
        path_tgt: data/english_train
        transforms: [sentencepiece, filtertoolong]
        weight: 1

    valid:
        path_src: data/spanish_valid
        path_tgt: data/english_valid
        transforms: [sentencepiece]

skip_empty_level: silent
src_subword_model: data/esen.model
tgt_subword_model: data/esen.model


# General opts
report_every: 100
train_steps: 5000
valid_steps: 1000
save_checkpoint_steps: 1000
world_size: 1
gpu_ranks: [0]

# Optimizer
optim: adam
learning_rate: 0.001

# Model
encoder_type: rnn
decoder_type: rnn
layers: 2
rnn_type: LSTM
bidir_edges: True


# Logging
tensorboard: true
tensorboard_log_dir: logs
log_file: logs/log-file.txt
verbose: True
attn_debug: True
align_debug: True
global_attention: general
global_attention_function: softmax

onmt_build_vocab -config en-sp.yaml -n_sample -1

onmt_train -config en-sp.yaml

Step 1000/ 5000; acc:  27.94; ppl: 71.88; xent: 4.27; lr: 0.00100; 13103/12039 tok/s;    157 sec
Validation perplexity: 136.446
Validation accuracy: 24.234

...

Step 4000/ 5000; acc:  61.25; ppl:  5.28; xent: 1.66; lr: 0.00100; 13584/12214 tok/s;    641 sec
Validation accuracy: 22.1157

...

When your training metric improves and your validation metric degrades, it means you are probably overfitting the model. In other words, the model is starting to memorize the training data while losing its capacity to generalise on new data. You should stop the training when this start to happen. There are several things you could do to mitigate or avoid this problem, but the easiest would be to add more training data.

1 Like

Cool yeah, I’m going to increase my dataset to have 100k lines of training, some more aggressive lstm dropout, and more training iterations. I’ll see what happens. Thanks!

src_vocab: en-sp/run/example.vocab.src
tgt_vocab: en-sp/run/example.vocab.tgt

Are these files the same but with different names?

Different tags at the end. I got this set up from the tutorial (QuickStep)

My understanding is that you are using a shared vocab, if the files have the same content but with different tag names, then it’s fine.
You can compare the files with “sort” and “uniq” commands in Linux.

Ah okay. Thanks, I’ll look into those commands