OpenNMT Forum

Indexed vocabularies issue

Hello,
I want to use parallel input in the training. So, I also used indexed vocabularies based on
https://opennmt.net/OpenNMT-tf/vocabulary.html.
Also I use OpenNMT-tf version:2.8

but I get this error
> `Traceback (most recent call last):

  File "/usr/local/bin/onmt-main", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/opennmt/bin/main.py", line 204, in main
    checkpoint_path=args.checkpoint_path)
  File "/usr/local/lib/python3.6/dist-packages/opennmt/runner.py", line 147, in train
    checkpoint, config = self._init_run(num_devices=num_devices, training=True)
  File "/usr/local/lib/python3.6/dist-packages/opennmt/runner.py", line 134, in _init_run
    return self._init_model(config), config
  File "/usr/local/lib/python3.6/dist-packages/opennmt/runner.py", line 120, in _init_model
    model.initialize(config["data"], params=config["params"])
  File "/usr/local/lib/python3.6/dist-packages/opennmt/models/sequence_to_sequence.py", line 127, in initialize
    super(SequenceToSequence, self).initialize(data_config, params=params)
  File "/usr/local/lib/python3.6/dist-packages/opennmt/models/model.py", line 86, in initialize
    self.examples_inputter.initialize(data_config)
  File "/usr/local/lib/python3.6/dist-packages/opennmt/models/sequence_to_sequence.py", line 426, in initialize
    super(SequenceToSequenceInputter, self).initialize(data_config, asset_prefix=asset_prefix)
  File "/usr/local/lib/python3.6/dist-packages/opennmt/inputters/inputter.py", line 209, in initialize
    data_config, asset_prefix=_get_asset_prefix(asset_prefix, inputter, i))
  File "/usr/local/lib/python3.6/dist-packages/opennmt/inputters/text_inputter.py", line 381, in initialize
    super(WordEmbedder, self).initialize(data_config, asset_prefix=asset_prefix)
  File "/usr/local/lib/python3.6/dist-packages/opennmt/inputters/text_inputter.py", line 254, in initialize
    data_config, "vocabulary", prefix=asset_prefix, required=True)
  File "/usr/local/lib/python3.6/dist-packages/opennmt/inputters/text_inputter.py", line 208, in _get_field
    raise ValueError("Missing field '%s' in the data configuration" % key)
ValueError: Missing field 'source_vocabulary' in the data configuration
`

**this is my yaml

file:**
model_dir: Deen_transformer
gpu_allow_growth: true
data:
train_features_file:
- data/tok/WMT-News.de-en.de.tok
- data/tok/QED.de-en.de.tok
- data/tok/Tatoeba.de-en.de.tok
- data/tok/TED2013.de-en.de.tok
- data/tok/TildeMODEL.de-en.de.tok
- data/tok/Wikipedia.de-en.de.tok
- data/tok/EUbookshop.de-en.de.tok
train_labels_file:
- data/tok/WMT-News.de-en.en.tok
- data/tok/QED.de-en.en.tok
- data/tok/Tatoeba.de-en.en.tok
- data/tok/TED2013.de-en.en.tok
- data/tok/TildeMODEL.de-en.en.tok
- data/tok/Wikipedia.de-en.en.tok
- data/tok/EUbookshop.de-en.en.tok
eval_features_file: data/tok/test.de.tok
eval_labels_file: data/tok/test.en.tok
source_1_vocabulary: data/vocab/WMT-News.de-en.de.vocab
source_2_vocabulary: data/vocab/QED.de-en.de.vocab
source_3_vocabulary: data/vocab/Tatoeba.de-en.de.vocab
source_4_vocabulary: data/vocab/TED2013.de-en.de.vocab
source_5_vocabulary: data/vocab/TildeMODEL.de-en.de.vocab
source_6_vocabulary: data/vocab/Wikipedia.de-en.de.vocab
source_7_vocabulary: data/vocab/EUbookshop.de-en.de.vocab
target_1_vocabulary: data/vocab/WMT-News.de-en.de.en.vocab
target_2_vocabulary: data/vocab/QED.de-en.en.vocab
target_3_vocabulary: data/vocab/Tatoeba.de-en.en.vocab
target_4_vocabulary: data/vocab/TED2013.de-en.en.vocab
target_5_vocabulary: data/vocab/TildeMODEL.de-en.en.vocab
target_6_vocabulary: data/vocab/Wikipedia.de-en.en.vocab
target_7_vocabulary: data/vocab/EUbookshop.de-en.en.vocab

thanks

Hi,

What is your model definition?

Hi,
I use transformer

onmt-main --model_type Transformer
–config config/GMT_deen.yml --auto_config
train --with_eval

A multi-feature Transformer is not the same architecture as the default Transformer. You should provide a custom model definition to at least configure the embedding dimension of each feature and how they are merged.

See for example this model which defines 3 input features that are concatenated:

My mistake, sorry…
Thank you for quick response

Also note that target features are not supported.

thank you.
In this case should I build a one vocabulary target file from all my train_label_files?

You should only build the vocabulary for the actual target file.

Now that I read your YAML configuration file again, are those training files actually parallel input features? From the names, it looks like they are unrelated training files (WMT, TED, etc.).

Actually I think I misunderstood multi-feature.
I want to use parallel inputs and weight them for training. based on document https://opennmt.net/OpenNMT-tf/data.html#parallel-inputs
In the document said

Parallel inputs require indexed vocabularies
https://opennmt.net/OpenNMT-tf/vocabulary.html#configuring-vocabularies

“parallel” means “aligned” in this context. Are your files actually aligned?

If not, maybe you are looking for weighted inputs? Here it is just about interleaving data coming from multiple datasets and it does not require a different model architecture nor multiple vocabularies.

That’s great,
the word “parallel” just confused me. :sweat_smile:
Thank you for your help