Overfitting Model

DerXter · February 24, 2022, 4:11pm

Hello everyone,

I’m trying to train an LSTM model on a French/Wolof corpus (a morphologically rich language spoken in Senegal, West Africa) of 35k/35k/30k for Train/Dev/Test. The training goes quite well (BLEU score > 30) when I apply a BPE tokenization (with a SentencePiece model) but when I remove it and do a Whitespace tokenization (or train on the raw data) the BLEU score drops drastically (< 0)
I noticed that the model starts to overfit after 2k epochs and the validation perplexity starts to increase from that point and continues to increase. Here is the configuration I used:

save_data: run

# Training files
data:
    corpus_1:
        path_src: tokenized/src_french.csv.train
        path_tgt: tokenized/tgt_wolof.csv.train
        transforms: [filtertoolong]
    valid:
        path_src: tokenized/src_french.csv.dev
        path_tgt: tokenized/tgt_wolof.csv.dev
        transforms: [filtertoolong]

# Vocabulary files, generated by onmt_build_vocab
src_vocab: vocabularies/source.vocab
tgt_vocab: vocabularies/target.vocab

# Vocabulary size
src_vocab_size: 10000
tgt_vocab_size: 10000

# Filter out source/target longer than n if [filtertoolong] enabled
src_seq_length: 200
src_seq_length: 200

# Tokenization options
#src_subword_model: source.model
#tgt_subword_model: target.model

# Where to save the log file and the output models/checkpoints
log_file: log/train.log
save_model: models/model.frwo

# Stop training if it does not imporve after n validations
early_stopping: 8

# Default: 5000 - Save a model checkpoint for each n
save_checkpoint_steps: 1000

# To save space, limit checkpoints to last n
# keep_checkpoint: 3

seed: 19

# Default: 100000 - Train the model to max n steps 
# Increase for large datasets
train_steps: 10000

# Default: 10000 - Run validation after n steps
valid_steps: 2000

# Default: 4000 - for large datasets, try up to 8000
report_every: 100

decoder_type: rnn
encoder_type: rnn
word_vec_size: 128
rnn_size: 300
enc_layers: 1
dec_layers: 1
rnn_type: LSTM

optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.00

# Tokens per batch, change if out of GPU memory
batch_size: 4096
valid_batch_size: 4096
batch_type: tokens
normalization: tokens
dropout: 0.1
label_smoothing: 0.1

max_generator_batches: 2

param_init: 0.0
#param_init_glorot: 'true'

# Number of GPUs, and IDs of GPUs
world_size: 1
gpu_ranks: [0]

guillaumekln · February 28, 2022, 9:14am

Hello,

If the training with BPE works well, why do you want to remove BPE?

More generally, you don’t have a lot of training data. If you can’t gather more data, I suggest restricting the dev and test sets to no more than 1k sentences and move everything else to the training data.

DerXter · February 28, 2022, 9:58am

Hello @guillaumekln, thanks for your answer!

I wanted to make a comparison with this paper that did a training on tokenized data with a whitespace tokenization to evaluate the BPE contribution on the model performance. I tried to use the same train/dev/test configuration that’s why I split my dataset this way but I will take your suggestion into account afterwards.