"KeyError: 'src_feats'" when trying to use source word features (OpenNMT-py)

EMcG · January 23, 2024, 4:43pm

Hi everyone,

I’m having trouble using source features in the v3 format while re-running some old experiments with the new feature format (W1￨F1 W2￨F2 … Wn￨Fn) rather than two parallel files. Below I’m showing my config .yaml file that I’m using and command I run to fine-tune my model, followed by the error I receive.
For context, when I run this without attempting to use features, everything functions as expected and my models continue to train.
Thank you!

 Finetune best model from run 1

## Where the samples will be written
save_data: tg-finetune
## Where the vocab(s) will be written
src_vocab: tg-finetune/tg-finetune.vocab.src
tgt_vocab: tg-finetune/tg-finetune.vocab.tgt
# Prevent overwriting existing files in the folder
overwrite: True

#src_feats: None

# Corpus opts:
data:
    corpus_1:
        path_src: eslse_train_es_pos_tagged.txt
        path_tgt: eslse_train_gloss.txt
        transforms: [inferfeats]
        weight: 1
    valid:
        path_src: eslse_dev_es_tok.txt
        path_tgt: eslse_dev_gloss.txt
        transforms: [inferfeats]
# Train on a single GPU
world_size: 1
gpu_ranks: [0]

# Where to save the checkpoints - for finetuning must specify # of steps from final checkpoint!
save_model: tg-finetune/tg-finetune_8400_01
save_checkpoint_steps: 200
train_steps: 13400
valid_steps: 200

# Transform options
reversible_tokenization: "joiner"

# Features options
n_src_feats: 1
src_feats_defaults: "X"
feat_merge: "concat"

The command I run:

python3 ../OpenNMT-py/train.py --config  eslse_pos_8400_run01.yaml --train_from tg-pretrain/models/tg-pretrain_03_step_8400.pt --reset_optim keep_states --log_file tg-features/eslse_ft_tat_8400_run01.log

The error I receive:

[2024-01-23 16:31:55,491 INFO] Weighted corpora loaded so far:
			* corpus_1: 276
Traceback (most recent call last):
  File "/home/ubuntu/lse_exps/../OpenNMT-py/train.py", line 6, in <module>
    main()
  File "/home/ubuntu/OpenNMT-py/onmt/bin/train.py", line 67, in main
    train(opt)
  File "/home/ubuntu/OpenNMT-py/onmt/bin/train.py", line 52, in train
    train_process(opt, device_id=0)
  File "/home/ubuntu/OpenNMT-py/onmt/train_single.py", line 238, in main
    trainer.train(
  File "/home/ubuntu/OpenNMT-py/onmt/trainer.py", line 308, in train
    for i, (batches, normalization) in enumerate(self._accum_batches(train_iter)):
  File "/home/ubuntu/OpenNMT-py/onmt/trainer.py", line 238, in _accum_batches
    for batch, bucket_idx in iterator:
  File "/home/ubuntu/OpenNMT-py/onmt/inputters/dynamic_iterator.py", line 373, in __iter__
    for (tensor_batch, bucket_idx) in self.data_iter:
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch
    data = next(self.dataset_iter)
  File "/home/ubuntu/OpenNMT-py/onmt/inputters/dynamic_iterator.py", line 341, in __iter__
    for bucket, bucket_idx in self._bucketing():
  File "/home/ubuntu/OpenNMT-py/onmt/inputters/dynamic_iterator.py", line 278, in _bucketing
    yield (self._tuple_to_json_with_tokIDs(bucket), self.bucket_idx)
  File "/home/ubuntu/OpenNMT-py/onmt/inputters/dynamic_iterator.py", line 252, in _tuple_to_json_with_tokIDs
    bucket.append(numericalize(self.vocabs, example))
  File "/home/ubuntu/OpenNMT-py/onmt/inputters/text_utils.py", line 149, in numericalize
    for fv, feat in zip(vocabs["src_feats"], example["src"]["feats"]):
KeyError: 'src_feats'

Thanks in advance for anyone who can offer me some advice!