Hello, today I’m trying to finetune a tiny transformer model trained on a translation task on a small dataset of very good parallel sentences that comes from flores-200 (I know that this dataset is only centered on evaluation task but I need high-quality parallel sentences, this model will not be used in production).
So I have already trained the model on 200k sentences from wikipedia and now I’m trying to finetune it on 2k sentences but I have these logs in loops:
[2023-07-06 21:59:25,144 INFO] Weighted corpora loaded so far:
* corpus_1: 65
[2023-07-06 21:59:25,145 INFO] Weighted corpora loaded so far:
* corpus_1: 64
[2023-07-06 21:59:25,149 INFO] Weighted corpora loaded so far:
* corpus_1: 66
[2023-07-06 21:59:25,151 INFO] Weighted corpora loaded so far:
* corpus_1: 65
[2023-07-06 21:59:25,155 INFO] Weighted corpora loaded so far:
* corpus_1: 67
[2023-07-06 21:59:25,157 INFO] Weighted corpora loaded so far:
* corpus_1: 66
[2023-07-06 21:59:25,163 INFO] Weighted corpora loaded so far:
* corpus_1: 68
[2023-07-06 21:59:25,163 INFO] Weighted corpora loaded so far:
* corpus_1: 67
[2023-07-06 21:59:25,169 INFO] Weighted corpora loaded so far:
* corpus_1: 69
[2023-07-06 21:59:25,171 INFO] Weighted corpora loaded so far:
* corpus_1: 68
[2023-07-06 21:59:25,174 INFO] Weighted corpora loaded so far:
* corpus_1: 70
[2023-07-06 21:59:25,177 INFO] Weighted corpora loaded so far:
* corpus_1: 69
[2023-07-06 21:59:25,180 INFO] Weighted corpora loaded so far:
* corpus_1: 71
[2023-07-06 21:59:25,182 INFO] Weighted corpora loaded so far:
* corpus_1: 70
[2023-07-06 21:59:25,187 INFO] Weighted corpora loaded so far:
* corpus_1: 72
[2023-07-06 21:59:25,190 INFO] Weighted corpora loaded so far:
* corpus_1: 71
[2023-07-06 21:59:25,193 INFO] Weighted corpora loaded so far:
* corpus_1: 73
[2023-07-06 21:59:25,196 INFO] Weighted corpora loaded so far:
* corpus_1: 72
[2023-07-06 21:59:25,199 INFO] Weighted corpora loaded so far:
* corpus_1: 74
[2023-07-06 21:59:25,202 INFO] Weighted corpora loaded so far:
* corpus_1: 73
[2023-07-06 21:59:25,207 INFO] Weighted corpora loaded so far:
* corpus_1: 75
[2023-07-06 21:59:25,208 INFO] Weighted corpora loaded so far:
* corpus_1: 74
[2023-07-06 21:59:25,212 INFO] Weighted corpora loaded so far:
* corpus_1: 76
[2023-07-06 21:59:25,215 INFO] Weighted corpora loaded so far:
* corpus_1: 75
[2023-07-06 21:59:25,218 INFO] Weighted corpora loaded so far:
* corpus_1: 77
[2023-07-06 21:59:25,222 INFO] Weighted corpora loaded so far:
* corpus_1: 76
[2023-07-06 21:59:25,224 INFO] Weighted corpora loaded so far:
* corpus_1: 78
[2023-07-06 21:59:25,227 INFO] Weighted corpora loaded so far:
* corpus_1: 77
[2023-07-06 21:59:25,233 INFO] Weighted corpora loaded so far:
* corpus_1: 78
[2023-07-06 21:59:25,319 INFO] Weighted corpora loaded so far:
* corpus_1: 79
[2023-07-06 21:59:25,324 INFO] Weighted corpora loaded so far:
* corpus_1: 80
[2023-07-06 21:59:25,328 INFO] Weighted corpora loaded so far:
* corpus_1: 79
[2023-07-06 21:59:25,330 INFO] Weighted corpora loaded so far:
* corpus_1: 81
[2023-07-06 21:59:25,334 INFO] Weighted corpora loaded so far:
* corpus_1: 80
[2023-07-06 21:59:25,336 INFO] Weighted corpora loaded so far:
* corpus_1: 82
[2023-07-06 21:59:25,340 INFO] Weighted corpora loaded so far:
* corpus_1: 81
[2023-07-06 21:59:25,343 INFO] Weighted corpora loaded so far:
* corpus_1: 83
[2023-07-06 21:59:25,347 INFO] Weighted corpora loaded so far:
* corpus_1: 82
[2023-07-06 21:59:25,349 INFO] Weighted corpora loaded so far:
* corpus_1: 84
[2023-07-06 21:59:25,353 INFO] Weighted corpora loaded so far:
* corpus_1: 83
[2023-07-06 21:59:25,355 INFO] Weighted corpora loaded so far:
* corpus_1: 85
[2023-07-06 21:59:25,359 INFO] Weighted corpora loaded so far:
* corpus_1: 84
[2023-07-06 21:59:25,362 INFO] Weighted corpora loaded so far:
* corpus_1: 86
[2023-07-06 21:59:25,365 INFO] Weighted corpora loaded so far:
* corpus_1: 85
[2023-07-06 21:59:25,368 INFO] Weighted corpora loaded so far:
* corpus_1: 87
and after a moment the training process just stop without logs like “step 400/3000…”.
Here is the start of the training process logs:
[2023-07-06 21:59:21,550 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2023-07-06 21:59:21,550 WARNING] Corpus corpus_1's weight should be given. We default it to 1 for you.
[2023-07-06 21:59:21,550 INFO] Missing transforms field for valid data, set to default: [].
[2023-07-06 21:59:21,550 INFO] Parsed 2 corpora from -data.
[2023-07-06 21:59:21,550 INFO] Loading checkpoint from models/model.fren_step_52000.pt
[2023-07-06 21:59:21,889 INFO] Updating checkpoint vocabulary with new vocabulary
[2023-07-06 21:59:21,889 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2023-07-06 21:59:22,053 INFO] The first 10 tokens of the vocabs are:['<unk>', '<blank>', '▁the', ',', '<s>', '</s>', '.', '▁of', '▁and', '▁in']
[2023-07-06 21:59:22,053 INFO] The decoder start token is: <s>
[2023-07-06 21:59:22,059 INFO] Building model...
[2023-07-06 21:59:22,601 INFO] Switching model to float32 for amp/apex_amp
[2023-07-06 21:59:22,601 INFO] Non quantized layer compute is fp16
[2023-07-06 21:59:22,601 INFO] Updating vocabulary embeddings with checkpoint embeddings
[2023-07-06 21:59:22,738 INFO] src: 0 new tokens
[2023-07-06 21:59:23,029 INFO] tgt: 0 new tokens
[2023-07-06 21:59:23,369 INFO] NMTModel(
(encoder): TransformerEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(50000, 512, padding_idx=1)
)
(pe): PositionalEncoding()
)
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer): ModuleList(
(0-5): 6 x TransformerEncoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=False)
(linear_values): Linear(in_features=512, out_features=512, bias=False)
(linear_query): Linear(in_features=512, out_features=512, bias=False)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=False)
(w_2): Linear(in_features=2048, out_features=512, bias=False)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
)
(decoder): TransformerDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(50000, 512, padding_idx=1)
)
(pe): PositionalEncoding()
)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(transformer_layers): ModuleList(
(0-5): 6 x TransformerDecoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=False)
(linear_values): Linear(in_features=512, out_features=512, bias=False)
(linear_query): Linear(in_features=512, out_features=512, bias=False)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=False)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=512, out_features=2048, bias=False)
(w_2): Linear(in_features=2048, out_features=512, bias=False)
(layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm_1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(context_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=512, out_features=512, bias=False)
(linear_values): Linear(in_features=512, out_features=512, bias=False)
(linear_query): Linear(in_features=512, out_features=512, bias=False)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=512, out_features=512, bias=False)
)
(layer_norm_2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
)
)
)
(generator): Linear(in_features=512, out_features=50000, bias=True)
)
[2023-07-06 21:59:23,370 INFO] encoder: 44487680
[2023-07-06 21:59:23,370 INFO] decoder: 76435280
[2023-07-06 21:59:23,370 INFO] * number of parameters: 120922960
[2023-07-06 21:59:23,371 INFO] Trainable parameters = {'torch.float32': 120922960, 'torch.float16': 0, 'torch.uint8': 0, 'torch.int8': 0}
[2023-07-06 21:59:23,371 INFO] Non trainable parameters = {'torch.float32': 0, 'torch.float16': 0, 'torch.uint8': 0, 'torch.int8': 0}
[2023-07-06 21:59:23,371 INFO] * src vocab size = 50000
[2023-07-06 21:59:23,371 INFO] * tgt vocab size = 50000
[2023-07-06 21:59:23,378 INFO] Starting training on GPU: [0]
[2023-07-06 21:59:23,378 INFO] Start training loop and validate every 1000 steps...
[2023-07-06 21:59:23,378 INFO] Scoring with: TransformPipe()
[2023-07-06 21:59:24,046 INFO] Weighted corpora loaded so far:
* corpus_1: 1
[2023-07-06 21:59:24,686 INFO] Weighted corpora loaded so far:
* corpus_1: 1
[2023-07-06 21:59:24,692 INFO] Weighted corpora loaded so far:
* corpus_1: 2
and here my configuration file
save_data: opennmt
# Training files
data:
corpus_1:
path_src: temps/src-flores.subword
path_tgt: temps/tgt-flores.subword
valid:
path_src: temps/src-flores-val.subword
path_tgt: temps/tgt-flores-val.subword
train_from: "models/model.fren_step_52000.pt"
update_vocab: true
reset_optim: all
# Vocabulary files, generated by onmt_build_vocab
src_vocab: opennmt/source.vocab
tgt_vocab: opennmt/target.vocab
# Vocabulary size - should be the same as in sentence piece
src_vocab_size: 50000
tgt_vocab_size: 50000
# Tokenization options
src_subword_model: models/source.model
tgt_subword_model: models/target.model
# Where to save the log file and the output models/checkpoints
log_file: temps/train.log
save_model: models/model.fren
save_checkpoint_steps: 500
train_steps: 3000
valid_steps: 1000
# Train on a single GPU
world_size: 1
gpu_ranks: [0]
keep_checkpoint: 3
seed: 3435
# Default: 4000 - for large datasets, try up to 8000
warmup_steps: 4000
report_every: 100
# Batching
bucket_size: 262144
batch_type: "tokens"
batch_size: 4096
valid_batch_size: 2048
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]
# Optimization
multiquery: true
add_ffnbias: false
model_dtype: "fp16"
optim: "adam"
learning_rate: 2
# warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"
# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 2
dec_layers: 2
heads: 2
hidden_size: 256
word_vec_size: 256
transformer_ff: 64
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
The configuration file of my base model is exactly the same (without train_from: "models/model.fren_step_52000.pt" update_vocab: true reset_optim: all
and different files for the dataset)
It happens only on small datasets, on bigger datasets I don’t have this issue.
My question is, is it normal? In a finetune there must be logs of the training process?
I hope I’ll find the answer, but for now I can only thank you for reading this far.