Dual-source Transformer produces only <unk> tags in inference

Hello!

I have trained a Dual-source Transformer with two inputs, my configuration file looks like this:

# config.yaml

model_dir: models_6/

data:
  source_1_vocabulary: run/source.vocab
  source_2_vocabulary: run/source1.vocab
  target_vocabulary: run/target.vocab
  train_features_file:
    - file2_top2.txt-filtered.ru.subword.train.desubword.subword # Subworded text
    - file2_top2.txt-filtered.ru.subword.train.desubword.pos  # POS tags, where POS tags of the whole word is extended to its subword units, so the number of elements in both files is the same
  train_labels_file: file1_top2.txt-filtered.en.subword.train
  eval_features_file:
    - file2_top2.txt-filtered.ru.subword.dev.desubword.subword # Dev file for inputter 1
    - file2_top2.txt-filtered.ru.subword.dev.desubword.pos # Dev file for inputter 2
  eval_labels_file: file1_top2.txt-filtered.en.subword.dev
        
src_vocab_size: 50000
src_subword: target_probs_ENV_RU.model
src1_vocab_size: 17
src_seq_length: 150
tgt_subword_model: source_probs_ENV_ENG.model
tgt_vocab_size: 50000

eval:
  batch_size: 32
  batch_type: examples
  length_bucket_width: 5
infer:
  batch_size: 32
  batch_type: examples
  length_bucket_width: 5
model_dir: models_6/
params:
  average_loss_in_time: true
  beam_width: 4
  decay_params:
    model_dim: 512
    warmup_steps: 8000
  decay_type: NoamDecay
  label_smoothing: 0.1
  learning_rate: 2.0
  num_hypotheses: 1
  optimizer: LazyAdam
  optimizer_params:
    beta_1: 0.9
    beta_2: 0.998
score:
  batch_size: 64
  batch_type: examples
  length_bucket_width: 5
src1_vocab_size: 17
src_seq_length: 150
src_subword: target_probs_ENV_RU.model
src_vocab_size: 50000
tgt_subword_model: source_probs_ENV_ENG.model
tgt_vocab_size: 50000
train:
  average_last_checkpoints: 8
  batch_size: 3072
  batch_type: tokens
  effective_batch_size: 25000
  keep_checkpoint_max: 8
  length_bucket_width: 2
  max_step: 100000
  maximum_features_length:
  - 100
  - 100
  maximum_labels_length: 100
  sample_buffer_size: -1
  save_summary_steps: 100
  save_checkpoints_steps: 1000

I basically used the automatic configuration parameters, except for saving checkpoints and max training steps.

It does not throw any errors, but Loss stalls at around 1.5 very early in the training (around step 1,000 while the learning rate continues to increase. I have previously tried learning rates of 0.002, 0.001, 1 and 2. Here is an example of the training log:

2024-05-06 14:35:18.903000: I training.py:176] Saved checkpoint models_6/ckpt-4000
2024-05-06 14:38:29.648000: I runner.py:310] Step = 4100 ; steps/s = 0.52, tokens/s = 9069 (9069 target) ; Learning rate = 0.000507 ; Loss = 1.466390
2024-05-06 14:41:39.976000: I runner.py:310] Step = 4200 ; steps/s = 0.53, tokens/s = 9080 (9080 target) ; Learning rate = 0.000519 ; Loss = 1.461054
2024-05-06 14:44:50.822000: I runner.py:310] Step = 4300 ; steps/s = 0.52, tokens/s = 9050 (9050 target) ; Learning rate = 0.000531 ; Loss = 1.453110
2024-05-06 14:48:01.771000: I runner.py:310] Step = 4400 ; steps/s = 0.52, tokens/s = 9071 (9071 target) ; Learning rate = 0.000544 ; Loss = 1.458882
2024-05-06 14:51:12.239000: I runner.py:310] Step = 4500 ; steps/s = 0.53, tokens/s = 9081 (9081 target) ; Learning rate = 0.000556 ; Loss = 1.486341
2024-05-06 14:54:23.148000: I runner.py:310] Step = 4600 ; steps/s = 0.52, tokens/s = 9037 (9037 target) ; Learning rate = 0.000568 ; Loss = 1.526749
2024-05-06 14:57:29.492000: I runner.py:310] Step = 4700 ; steps/s = 0.54, tokens/s = 9035 (9035 target) ; Learning rate = 0.000581 ; Loss = 1.464090
2024-05-06 15:00:39.649000: I runner.py:310] Step = 4800 ; steps/s = 0.53, tokens/s = 9091 (9091 target) ; Learning rate = 0.000593 ; Loss = 1.461399
2024-05-06 15:03:49.764000: I runner.py:310] Step = 4900 ; steps/s = 0.53, tokens/s = 9103 (9103 target) ; Learning rate = 0.000605 ; Loss = 1.475364

I tried running inference with this model trained to 4K steps, but the predictions are just sequences of UNK tags.

I am new to the field and I would be very grateful for any suggestions about what this issue is related to and what I could try to train the model successfully.

Many thanks in advance!