I am new to OpenNMT. I am trying to train a model using a single parallel corpus (~1L sentence pairs) and one validation corpus (1K sentence pairs). When the training starts, I see the 'Weighted corpora loaded so far: *corpus_1: line so many times where the ‘value’ keeps increasing. I am adding a piece from the logs.
[2024-04-28 13:04:48,280 INFO] Weighted corpora loaded so far:
* corpus_1: 2
[2024-04-28 13:04:48,479 INFO] Weighted corpora loaded so far:
* corpus_1: 3
[2024-04-28 13:04:48,495 INFO] * Transform statistics for corpus_1(3.12%):
* FilterTooLongStats(filtered=1)
[2024-04-28 13:04:48,496 INFO] Weighted corpora loaded so far:
* corpus_1: 3
[2024-04-28 13:04:48,518 INFO] Weighted corpora loaded so far:
* corpus_1: 3
[2024-04-28 13:04:48,569 INFO] Weighted corpora loaded so far:
* corpus_1: 3
[2024-04-28 13:04:48,575 INFO] Weighted corpora loaded so far:
* corpus_1: 3
[2024-04-28 13:04:48,595 INFO] * Transform statistics for corpus_1(3.12%):
* FilterTooLongStats(filtered=1)
Could you please help me understand what is going on here? What does ‘Corpus_1: 2’ or ‘Corupus_1: 3’ mean?
Relevent contents from the config file are as follows:
data:
corpus_1:
path_src: data/bpe-train.ko
path_tgt: data/bpe-train.hi
transforms: [filtertoolong]
weight: 1
valid:
path_src: data/bpe-dev.ko
path_tgt: data/bpe-dev.hi
transforms: [filtertoolong]
# General opts
log_file: logs.txt
save_model: checkpoints/model
tensorboard_log_dir: "tensorboard"
tensorboard: true
keep_checkpoint: 20
save_checkpoint_steps: 10000
average_decay: 0.0005
seed: 1234
report_every: 1000
train_steps: 400000
valid_steps: 100
# Batching
bucket_size: 144
num_workers: 32
batch_size: 8192
batch_type: "tokens"
normalization: "tokens"
dropout: 0.1
label_smoothing: 0.1
Thank you!