Hello!
I have trained a Dual-source Transformer with two inputs, my configuration file looks like this:
# config.yaml
model_dir: models_6/
data:
source_1_vocabulary: run/source.vocab
source_2_vocabulary: run/source1.vocab
target_vocabulary: run/target.vocab
train_features_file:
- file2_top2.txt-filtered.ru.subword.train.desubword.subword # Subworded text
- file2_top2.txt-filtered.ru.subword.train.desubword.pos # POS tags, where POS tags of the whole word is extended to its subword units, so the number of elements in both files is the same
train_labels_file: file1_top2.txt-filtered.en.subword.train
eval_features_file:
- file2_top2.txt-filtered.ru.subword.dev.desubword.subword # Dev file for inputter 1
- file2_top2.txt-filtered.ru.subword.dev.desubword.pos # Dev file for inputter 2
eval_labels_file: file1_top2.txt-filtered.en.subword.dev
src_vocab_size: 50000
src_subword: target_probs_ENV_RU.model
src1_vocab_size: 17
src_seq_length: 150
tgt_subword_model: source_probs_ENV_ENG.model
tgt_vocab_size: 50000
eval:
batch_size: 32
batch_type: examples
length_bucket_width: 5
infer:
batch_size: 32
batch_type: examples
length_bucket_width: 5
model_dir: models_6/
params:
average_loss_in_time: true
beam_width: 4
decay_params:
model_dim: 512
warmup_steps: 8000
decay_type: NoamDecay
label_smoothing: 0.1
learning_rate: 2.0
num_hypotheses: 1
optimizer: LazyAdam
optimizer_params:
beta_1: 0.9
beta_2: 0.998
score:
batch_size: 64
batch_type: examples
length_bucket_width: 5
src1_vocab_size: 17
src_seq_length: 150
src_subword: target_probs_ENV_RU.model
src_vocab_size: 50000
tgt_subword_model: source_probs_ENV_ENG.model
tgt_vocab_size: 50000
train:
average_last_checkpoints: 8
batch_size: 3072
batch_type: tokens
effective_batch_size: 25000
keep_checkpoint_max: 8
length_bucket_width: 2
max_step: 100000
maximum_features_length:
- 100
- 100
maximum_labels_length: 100
sample_buffer_size: -1
save_summary_steps: 100
save_checkpoints_steps: 1000
I basically used the automatic configuration parameters, except for saving checkpoints and max training steps.
It does not throw any errors, but Loss stalls at around 1.5 very early in the training (around step 1,000 while the learning rate continues to increase. I have previously tried learning rates of 0.002, 0.001, 1 and 2. Here is an example of the training log:
2024-05-06 14:35:18.903000: I training.py:176] Saved checkpoint models_6/ckpt-4000
2024-05-06 14:38:29.648000: I runner.py:310] Step = 4100 ; steps/s = 0.52, tokens/s = 9069 (9069 target) ; Learning rate = 0.000507 ; Loss = 1.466390
2024-05-06 14:41:39.976000: I runner.py:310] Step = 4200 ; steps/s = 0.53, tokens/s = 9080 (9080 target) ; Learning rate = 0.000519 ; Loss = 1.461054
2024-05-06 14:44:50.822000: I runner.py:310] Step = 4300 ; steps/s = 0.52, tokens/s = 9050 (9050 target) ; Learning rate = 0.000531 ; Loss = 1.453110
2024-05-06 14:48:01.771000: I runner.py:310] Step = 4400 ; steps/s = 0.52, tokens/s = 9071 (9071 target) ; Learning rate = 0.000544 ; Loss = 1.458882
2024-05-06 14:51:12.239000: I runner.py:310] Step = 4500 ; steps/s = 0.53, tokens/s = 9081 (9081 target) ; Learning rate = 0.000556 ; Loss = 1.486341
2024-05-06 14:54:23.148000: I runner.py:310] Step = 4600 ; steps/s = 0.52, tokens/s = 9037 (9037 target) ; Learning rate = 0.000568 ; Loss = 1.526749
2024-05-06 14:57:29.492000: I runner.py:310] Step = 4700 ; steps/s = 0.54, tokens/s = 9035 (9035 target) ; Learning rate = 0.000581 ; Loss = 1.464090
2024-05-06 15:00:39.649000: I runner.py:310] Step = 4800 ; steps/s = 0.53, tokens/s = 9091 (9091 target) ; Learning rate = 0.000593 ; Loss = 1.461399
2024-05-06 15:03:49.764000: I runner.py:310] Step = 4900 ; steps/s = 0.53, tokens/s = 9103 (9103 target) ; Learning rate = 0.000605 ; Loss = 1.475364
I tried running inference with this model trained to 4K steps, but the predictions are just sequences of UNK tags.
I am new to the field and I would be very grateful for any suggestions about what this issue is related to and what I could try to train the model successfully.
Many thanks in advance!