Is it normal to see "Weighted corpora loaded so far" in a loop during the finetuning phase on a very small dataset?

Hello, today I’m trying to finetune a tiny transformer model trained on a translation task on a small dataset of very good parallel sentences that comes from flores-200 (I know that this dataset is only centered on evaluation task but I need high-quality parallel sentences, this model will not be used in production).

So I have already trained the model on 200k sentences from wikipedia and now I’m trying to finetune it on 2k sentences but I have these logs in loops:

[2023-07-06 21:59:25,144 INFO] Weighted corpora loaded so far:
                        * corpus_1: 65
[2023-07-06 21:59:25,145 INFO] Weighted corpora loaded so far:
                        * corpus_1: 64
[2023-07-06 21:59:25,149 INFO] Weighted corpora loaded so far:
                        * corpus_1: 66
[2023-07-06 21:59:25,151 INFO] Weighted corpora loaded so far:
                        * corpus_1: 65
[2023-07-06 21:59:25,155 INFO] Weighted corpora loaded so far:
                        * corpus_1: 67
[2023-07-06 21:59:25,157 INFO] Weighted corpora loaded so far:
                        * corpus_1: 66
[2023-07-06 21:59:25,163 INFO] Weighted corpora loaded so far:
                        * corpus_1: 68
[2023-07-06 21:59:25,163 INFO] Weighted corpora loaded so far:
                        * corpus_1: 67
[2023-07-06 21:59:25,169 INFO] Weighted corpora loaded so far:
                        * corpus_1: 69
[2023-07-06 21:59:25,171 INFO] Weighted corpora loaded so far:
                        * corpus_1: 68
[2023-07-06 21:59:25,174 INFO] Weighted corpora loaded so far:
                        * corpus_1: 70
[2023-07-06 21:59:25,177 INFO] Weighted corpora loaded so far:
                        * corpus_1: 69
[2023-07-06 21:59:25,180 INFO] Weighted corpora loaded so far:
                        * corpus_1: 71
[2023-07-06 21:59:25,182 INFO] Weighted corpora loaded so far:
                        * corpus_1: 70
[2023-07-06 21:59:25,187 INFO] Weighted corpora loaded so far:
                        * corpus_1: 72
[2023-07-06 21:59:25,190 INFO] Weighted corpora loaded so far:
                        * corpus_1: 71
[2023-07-06 21:59:25,193 INFO] Weighted corpora loaded so far:
                        * corpus_1: 73
[2023-07-06 21:59:25,196 INFO] Weighted corpora loaded so far:
                        * corpus_1: 72
[2023-07-06 21:59:25,199 INFO] Weighted corpora loaded so far:
                        * corpus_1: 74
[2023-07-06 21:59:25,202 INFO] Weighted corpora loaded so far:
                        * corpus_1: 73
[2023-07-06 21:59:25,207 INFO] Weighted corpora loaded so far:
                        * corpus_1: 75
[2023-07-06 21:59:25,208 INFO] Weighted corpora loaded so far:
                        * corpus_1: 74
[2023-07-06 21:59:25,212 INFO] Weighted corpora loaded so far:
                        * corpus_1: 76
[2023-07-06 21:59:25,215 INFO] Weighted corpora loaded so far:
                        * corpus_1: 75
[2023-07-06 21:59:25,218 INFO] Weighted corpora loaded so far:
                        * corpus_1: 77
[2023-07-06 21:59:25,222 INFO] Weighted corpora loaded so far:
                        * corpus_1: 76
[2023-07-06 21:59:25,224 INFO] Weighted corpora loaded so far:
                        * corpus_1: 78
[2023-07-06 21:59:25,227 INFO] Weighted corpora loaded so far:
                        * corpus_1: 77
[2023-07-06 21:59:25,233 INFO] Weighted corpora loaded so far:
                        * corpus_1: 78
[2023-07-06 21:59:25,319 INFO] Weighted corpora loaded so far:
                        * corpus_1: 79
[2023-07-06 21:59:25,324 INFO] Weighted corpora loaded so far:
                        * corpus_1: 80
[2023-07-06 21:59:25,328 INFO] Weighted corpora loaded so far:
                        * corpus_1: 79
[2023-07-06 21:59:25,330 INFO] Weighted corpora loaded so far:
                        * corpus_1: 81
[2023-07-06 21:59:25,334 INFO] Weighted corpora loaded so far:
                        * corpus_1: 80
[2023-07-06 21:59:25,336 INFO] Weighted corpora loaded so far:
                        * corpus_1: 82
[2023-07-06 21:59:25,340 INFO] Weighted corpora loaded so far:
                        * corpus_1: 81
[2023-07-06 21:59:25,343 INFO] Weighted corpora loaded so far:
                        * corpus_1: 83
[2023-07-06 21:59:25,347 INFO] Weighted corpora loaded so far:
                        * corpus_1: 82
[2023-07-06 21:59:25,349 INFO] Weighted corpora loaded so far:
                        * corpus_1: 84
[2023-07-06 21:59:25,353 INFO] Weighted corpora loaded so far:
                        * corpus_1: 83
[2023-07-06 21:59:25,355 INFO] Weighted corpora loaded so far:
                        * corpus_1: 85
[2023-07-06 21:59:25,359 INFO] Weighted corpora loaded so far:
                        * corpus_1: 84
[2023-07-06 21:59:25,362 INFO] Weighted corpora loaded so far:
                        * corpus_1: 86
[2023-07-06 21:59:25,365 INFO] Weighted corpora loaded so far:
                        * corpus_1: 85
[2023-07-06 21:59:25,368 INFO] Weighted corpora loaded so far:
                        * corpus_1: 87

and after a moment the training process just stop without logs like “step 400/3000…”.

Here is the start of the training process logs:

[2023-07-06 21:59:21,550 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2023-07-06 21:59:21,550 WARNING] Corpus corpus_1's weight should be given. We default it to 1 for you.
[2023-07-06 21:59:21,550 INFO] Missing transforms field for valid data, set to default: [].
[2023-07-06 21:59:21,550 INFO] Parsed 2 corpora from -data.
[2023-07-06 21:59:21,550 INFO] Loading checkpoint from models/model.fren_step_52000.pt
[2023-07-06 21:59:21,889 INFO] Updating checkpoint vocabulary with new vocabulary
[2023-07-06 21:59:21,889 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2023-07-06 21:59:22,053 INFO] The first 10 tokens of the vocabs are:['<unk>', '<blank>', '▁the', ',', '<s>', '</s>', '.', '▁of', '▁and', '▁in']
[2023-07-06 21:59:22,053 INFO] The decoder start token is: <s>
[2023-07-06 21:59:22,059 INFO] Building model...
[2023-07-06 21:59:22,601 INFO] Switching model to float32 for amp/apex_amp
[2023-07-06 21:59:22,601 INFO] Non quantized layer compute is fp16
[2023-07-06 21:59:22,601 INFO] Updating vocabulary embeddings with checkpoint embeddings
[2023-07-06 21:59:22,738 INFO] src: 0 new tokens
[2023-07-06 21:59:23,029 INFO] tgt: 0 new tokens
[2023-07-06 21:59:23,369 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(50000, 512, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=False)
          (linear_values): Linear(in_features=512, out_features=512, bias=False)
          (linear_query): Linear(in_features=512, out_features=512, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=False)
          (w_2): Linear(in_features=2048, out_features=512, bias=False)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(50000, 512, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
    (transformer_layers): ModuleList(
      (0-5): 6 x TransformerDecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=False)
          (linear_values): Linear(in_features=512, out_features=512, bias=False)
          (linear_query): Linear(in_features=512, out_features=512, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=False)
          (w_2): Linear(in_features=2048, out_features=512, bias=False)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm_1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (context_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=False)
          (linear_values): Linear(in_features=512, out_features=512, bias=False)
          (linear_query): Linear(in_features=512, out_features=512, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=False)
        )
        (layer_norm_2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
      )
    )
  )
  (generator): Linear(in_features=512, out_features=50000, bias=True)
)
[2023-07-06 21:59:23,370 INFO] encoder: 44487680
[2023-07-06 21:59:23,370 INFO] decoder: 76435280
[2023-07-06 21:59:23,370 INFO] * number of parameters: 120922960
[2023-07-06 21:59:23,371 INFO] Trainable parameters = {'torch.float32': 120922960, 'torch.float16': 0, 'torch.uint8': 0, 'torch.int8': 0}
[2023-07-06 21:59:23,371 INFO] Non trainable parameters = {'torch.float32': 0, 'torch.float16': 0, 'torch.uint8': 0, 'torch.int8': 0}
[2023-07-06 21:59:23,371 INFO]  * src vocab size = 50000
[2023-07-06 21:59:23,371 INFO]  * tgt vocab size = 50000
[2023-07-06 21:59:23,378 INFO] Starting training on GPU: [0]
[2023-07-06 21:59:23,378 INFO] Start training loop and validate every 1000 steps...
[2023-07-06 21:59:23,378 INFO] Scoring with: TransformPipe()
[2023-07-06 21:59:24,046 INFO] Weighted corpora loaded so far:
                        * corpus_1: 1
[2023-07-06 21:59:24,686 INFO] Weighted corpora loaded so far:
                        * corpus_1: 1
[2023-07-06 21:59:24,692 INFO] Weighted corpora loaded so far:
                        * corpus_1: 2

and here my configuration file

save_data: opennmt

# Training files
data:
    corpus_1:
        path_src: temps/src-flores.subword
        path_tgt: temps/tgt-flores.subword
    valid:
        path_src: temps/src-flores-val.subword
        path_tgt: temps/tgt-flores-val.subword
 
train_from: "models/model.fren_step_52000.pt"
update_vocab: true
reset_optim: all

# Vocabulary files, generated by onmt_build_vocab
src_vocab: opennmt/source.vocab
tgt_vocab: opennmt/target.vocab

# Vocabulary size - should be the same as in sentence piece
src_vocab_size: 50000
tgt_vocab_size: 50000

# Tokenization options
src_subword_model: models/source.model
tgt_subword_model: models/target.model

# Where to save the log file and the output models/checkpoints
log_file: temps/train.log
save_model: models/model.fren

save_checkpoint_steps: 500
train_steps: 3000
valid_steps: 1000

# Train on a single GPU
world_size: 1
gpu_ranks: [0]

keep_checkpoint: 3

seed: 3435

# Default: 4000 - for large datasets, try up to 8000
warmup_steps: 4000
report_every: 100

# Batching
bucket_size: 262144
batch_type: "tokens"
batch_size: 4096  
valid_batch_size: 2048
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

# Optimization
multiquery: true
add_ffnbias: false

model_dtype: "fp16"
optim: "adam"
learning_rate: 2
# warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 2
dec_layers: 2
heads: 2
hidden_size: 256
word_vec_size: 256
transformer_ff: 64
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]

The configuration file of my base model is exactly the same (without train_from: "models/model.fren_step_52000.pt" update_vocab: true reset_optim: all and different files for the dataset)

It happens only on small datasets, on bigger datasets I don’t have this issue.

My question is, is it normal? In a finetune there must be logs of the training process?

I hope I’ll find the answer, but for now I can only thank you for reading this far.

The issue is probably just the size of the dataset. You have a batch size of 4096 and a bucket size of 262,144. All the loading is literally because OpenNMT-py is trying to build the bucket of 262,144 tokens - which requires loading Flores200 a bunch of times as it’s such a small dataset.

1 Like

oh you are right! I just reduce bucket size to 144 and that work, thanks you a lot!