Indexed vocabularies issue

jalesiyan-hadis · March 3, 2020, 10:20am

Hello,
I want to use parallel input in the training. So, I also used indexed vocabularies based on
https://opennmt.net/OpenNMT-tf/vocabulary.html.
Also I use OpenNMT-tf version:2.8

but I get this error
> `Traceback (most recent call last):

  File "/usr/local/bin/onmt-main", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/opennmt/bin/main.py", line 204, in main
    checkpoint_path=args.checkpoint_path)
  File "/usr/local/lib/python3.6/dist-packages/opennmt/runner.py", line 147, in train
    checkpoint, config = self._init_run(num_devices=num_devices, training=True)
  File "/usr/local/lib/python3.6/dist-packages/opennmt/runner.py", line 134, in _init_run
    return self._init_model(config), config
  File "/usr/local/lib/python3.6/dist-packages/opennmt/runner.py", line 120, in _init_model
    model.initialize(config["data"], params=config["params"])
  File "/usr/local/lib/python3.6/dist-packages/opennmt/models/sequence_to_sequence.py", line 127, in initialize
    super(SequenceToSequence, self).initialize(data_config, params=params)
  File "/usr/local/lib/python3.6/dist-packages/opennmt/models/model.py", line 86, in initialize
    self.examples_inputter.initialize(data_config)
  File "/usr/local/lib/python3.6/dist-packages/opennmt/models/sequence_to_sequence.py", line 426, in initialize
    super(SequenceToSequenceInputter, self).initialize(data_config, asset_prefix=asset_prefix)
  File "/usr/local/lib/python3.6/dist-packages/opennmt/inputters/inputter.py", line 209, in initialize
    data_config, asset_prefix=_get_asset_prefix(asset_prefix, inputter, i))
  File "/usr/local/lib/python3.6/dist-packages/opennmt/inputters/text_inputter.py", line 381, in initialize
    super(WordEmbedder, self).initialize(data_config, asset_prefix=asset_prefix)
  File "/usr/local/lib/python3.6/dist-packages/opennmt/inputters/text_inputter.py", line 254, in initialize
    data_config, "vocabulary", prefix=asset_prefix, required=True)
  File "/usr/local/lib/python3.6/dist-packages/opennmt/inputters/text_inputter.py", line 208, in _get_field
    raise ValueError("Missing field '%s' in the data configuration" % key)
ValueError: Missing field 'source_vocabulary' in the data configuration
`

**this is my yaml

file:**
model_dir: Deen_transformer
gpu_allow_growth: true
data:
train_features_file:
- data/tok/WMT-News.de-en.de.tok
- data/tok/QED.de-en.de.tok
- data/tok/Tatoeba.de-en.de.tok
- data/tok/TED2013.de-en.de.tok
- data/tok/TildeMODEL.de-en.de.tok
- data/tok/Wikipedia.de-en.de.tok
- data/tok/EUbookshop.de-en.de.tok
train_labels_file:
- data/tok/WMT-News.de-en.en.tok
- data/tok/QED.de-en.en.tok
- data/tok/Tatoeba.de-en.en.tok
- data/tok/TED2013.de-en.en.tok
- data/tok/TildeMODEL.de-en.en.tok
- data/tok/Wikipedia.de-en.en.tok
- data/tok/EUbookshop.de-en.en.tok
eval_features_file: data/tok/test.de.tok
eval_labels_file: data/tok/test.en.tok
source_1_vocabulary: data/vocab/WMT-News.de-en.de.vocab
source_2_vocabulary: data/vocab/QED.de-en.de.vocab
source_3_vocabulary: data/vocab/Tatoeba.de-en.de.vocab
source_4_vocabulary: data/vocab/TED2013.de-en.de.vocab
source_5_vocabulary: data/vocab/TildeMODEL.de-en.de.vocab
source_6_vocabulary: data/vocab/Wikipedia.de-en.de.vocab
source_7_vocabulary: data/vocab/EUbookshop.de-en.de.vocab
target_1_vocabulary: data/vocab/WMT-News.de-en.de.en.vocab
target_2_vocabulary: data/vocab/QED.de-en.en.vocab
target_3_vocabulary: data/vocab/Tatoeba.de-en.en.vocab
target_4_vocabulary: data/vocab/TED2013.de-en.en.vocab
target_5_vocabulary: data/vocab/TildeMODEL.de-en.en.vocab
target_6_vocabulary: data/vocab/Wikipedia.de-en.en.vocab
target_7_vocabulary: data/vocab/EUbookshop.de-en.en.vocab

thanks

guillaumekln · March 3, 2020, 10:21am

Hi,

What is your model definition?

jalesiyan-hadis · March 3, 2020, 10:22am

Hi,
I use transformer

onmt-main --model_type Transformer
–config config/GMT_deen.yml --auto_config
train --with_eval

guillaumekln · March 3, 2020, 10:25am

A multi-feature Transformer is not the same architecture as the default Transformer. You should provide a custom model definition to at least configure the embedding dimension of each feature and how they are merged.

See for example this model which defines 3 input features that are concatenated:

github.com

OpenNMT/OpenNMT-tf/blob/master/config/models/multi_features_transformer.py

"""Defines a Transformer model with multiple input features. For example, these
could be words, parts of speech, and lemmas that are embedded in parallel and
concatenated into a single input embedding.

The features are separate data files with separate vocabularies. The YAML
configuration file should look like this:

data:
  train_features_file:
    - features_1.txt
    - features_2.txt
    - features_3.txt
  train_labels_file: target.txt
  source_1_vocabulary: feature_1_vocab.txt
  source_2_vocabulary: feature_2_vocab.txt
  source_3_vocabulary: feature_3_vocab.txt
  target_vocabulary: target_vocab.txt
"""

import tensorflow as tf

This file has been truncated. show original

jalesiyan-hadis · March 3, 2020, 10:28am

My mistake, sorry…
Thank you for quick response

guillaumekln · March 3, 2020, 10:31am

Also note that target features are not supported.

jalesiyan-hadis · March 3, 2020, 10:39am

thank you.
In this case should I build a one vocabulary target file from all my train_label_files?

guillaumekln · March 3, 2020, 12:59pm

You should only build the vocabulary for the actual target file.

Now that I read your YAML configuration file again, are those training files actually parallel input features? From the names, it looks like they are unrelated training files (WMT, TED, etc.).

jalesiyan-hadis · March 3, 2020, 1:13pm

Actually I think I misunderstood multi-feature.
I want to use parallel inputs and weight them for training. based on document https://opennmt.net/OpenNMT-tf/data.html#parallel-inputs
In the document said

Parallel inputs require indexed vocabularies
https://opennmt.net/OpenNMT-tf/vocabulary.html#configuring-vocabularies

guillaumekln · March 3, 2020, 1:17pm

“parallel” means “aligned” in this context. Are your files actually aligned?

If not, maybe you are looking for weighted inputs? Here it is just about interleaving data coming from multiple datasets and it does not require a different model architecture nor multiple vocabularies.

jalesiyan-hadis · March 3, 2020, 1:27pm

That’s great,
the word “parallel” just confused me.
Thank you for your help

guillaumekln · December 21, 2021, 8:47am

2 posts were split to a new topic: How to run dual source Transformer?