What does the value shown in "Weighted corpora loaded so far: *corpus_1: <value> mean?

avibrantsoul · April 28, 2024, 1:15pm

I am new to OpenNMT. I am trying to train a model using a single parallel corpus (~1L sentence pairs) and one validation corpus (1K sentence pairs). When the training starts, I see the 'Weighted corpora loaded so far: *corpus_1: line so many times where the ‘value’ keeps increasing. I am adding a piece from the logs.

[2024-04-28 13:04:48,280 INFO] Weighted corpora loaded so far:
			* corpus_1: 2
[2024-04-28 13:04:48,479 INFO] Weighted corpora loaded so far:
			* corpus_1: 3
[2024-04-28 13:04:48,495 INFO] * Transform statistics for corpus_1(3.12%):
			* FilterTooLongStats(filtered=1)

[2024-04-28 13:04:48,496 INFO] Weighted corpora loaded so far:
			* corpus_1: 3
[2024-04-28 13:04:48,518 INFO] Weighted corpora loaded so far:
			* corpus_1: 3
[2024-04-28 13:04:48,569 INFO] Weighted corpora loaded so far:
			* corpus_1: 3
[2024-04-28 13:04:48,575 INFO] Weighted corpora loaded so far:
			* corpus_1: 3
[2024-04-28 13:04:48,595 INFO] * Transform statistics for corpus_1(3.12%):
			* FilterTooLongStats(filtered=1)

Could you please help me understand what is going on here? What does ‘Corpus_1: 2’ or ‘Corupus_1: 3’ mean?

Relevent contents from the config file are as follows:

data:
    corpus_1:
      path_src: data/bpe-train.ko
      path_tgt: data/bpe-train.hi
      transforms: [filtertoolong]
      weight: 1
    valid:
      path_src: data/bpe-dev.ko
      path_tgt: data/bpe-dev.hi
      transforms: [filtertoolong]

# General opts
log_file: logs.txt
save_model: checkpoints/model

tensorboard_log_dir: "tensorboard"
tensorboard: true
keep_checkpoint: 20
save_checkpoint_steps: 10000
average_decay: 0.0005
seed: 1234
report_every: 1000
train_steps: 400000
valid_steps: 100

# Batching
bucket_size: 144
num_workers: 32
batch_size: 8192
batch_type: "tokens"
normalization: "tokens"
dropout: 0.1
label_smoothing: 0.1

Thank you!