Multiple sources with multiple features

wenzi3241 · March 31, 2020, 7:04pm

Hello Fellow Researchers,

I am trying to run onmt with multiple sources and multiple features. So I want to encode two different sources. For one of the sources, I need to concatenate all features and encode them with one encoder. I am trying to write a custom model custom_model.py like below,

from opennmt import models, inputters, encoders, layers, decoders

def model():
  return models.SequenceToSequence(
      source_inputter=inputters.ParallelInputter(
          [inputters.ParallelInputter(
            [inputters.WordEmbedder(embedding_size=3), 
             inputters.WordEmbedder(embedding_size=23), 
             inputters.WordEmbedder(embedding_size=43),
             inputters.WordEmbedder(embedding_size=3),
             inputters.WordEmbedder(embedding_size=100)
             ], combine_features=True, reducer=layers.ConcatReducer()), # Combine 5 features 
          inputters.WordEmbedder(embedding_size=300)]
      ),
      target_inputter=inputters.WordEmbedder(
          embedding_size=300),
      encoder=encoders.ParallelEncoder([
          encoders.RNNEncoder(1, 300, dropout=0.2),
          encoders.RNNEncoder(2, 300, dropout=0.2, bidirectional=True)],
          outputs_reducer=layers.ConcatReducer(axis=-1)),
      decoder=decoders.AttentionalRNNDecoder(
          num_layers=1,
          num_units=300,
          dropout=0.2))

And the config.yml looks like below,

model_dir: model/
data:
  train_features_file:
    - train_src_1_1.txt
    - train_src_1_2.txt
    - train_src_1_3.txt
    - train_src_1_4.txt
    - train_src_1_5.txt
    - train_src_2.txt
  train_labels_file: train-tgt.txt
  source_1_1_vocabulary: src_1_1_vocab.txt
  source_1_2_vocabulary: src_1_2_vocab.txt
  source_1_3_vocabulary: src_1_3_vocab.txt
  source_1_4_vocabulary: src_1_4_vocab.txt
  source_1_5_vocabulary: src_1_5_vocab.txt
  source_2_vocabulary: src_2_vocab.txt

  eval_features_file:
    - dev_src_1_1.txt
    - dev_src_1_2.txt
    - dev_src_1_3.txt
    - dev_src_1_4.txt
    - dev_src_1_5.txt
    - dev_src_2.txt
  eval_labels_file: dev-tgt.txt

  target_vocabulary: tgt-vocab.txt
params: ....
....

the command to run is,
onmt-main --model custom_model.py --config config/config.yml --auto_config train --with_eval

The error is
Traceback (most recent call last):
File “/anaconda3/envs/OpenNMT-tf/bin/onmt-main”, line 8, in
sys.exit(main())
File “/anaconda3/envs/OpenNMT-tf/lib/python3.7/site-packages/opennmt/bin/main.py”, line 204, in main
checkpoint_path=args.checkpoint_path)
File “/anaconda3/envs/OpenNMT-tf/lib/python3.7/site-packages/opennmt/runner.py”, line 180, in train
evaluator = evaluation.Evaluator.from_config(model, config)
File “/anaconda3/envs/OpenNMT-tf/lib/python3.7/site-packages/opennmt/evaluation.py”, line 166, in from_config
exporter=exporters.make_exporter(eval_config.get(“export_format”, “saved_model”)))
File “/anaconda3/envs/OpenNMT-tf/lib/python3.7/site-packages/opennmt/evaluation.py”, line 99, in init
prefetch_buffer_size=1)
File “/anaconda3/envs/OpenNMT-tf/lib/python3.7/site-packages/opennmt/inputters/inputter.py”, line 491, in make_evaluation_dataset
dataset = self.make_dataset([features_file, labels_file], training=False)
File “/anaconda3/envs/OpenNMT-tf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py”, line 431, in make_dataset
data_file, training=training)
File “/anaconda3/envs/OpenNMT-tf/lib/python3.7/site-packages/opennmt/inputters/inputter.py”, line 274, in make_dataset
dataset = inputter.make_dataset(data, training=training)
File “/anaconda3/envs/OpenNMT-tf/lib/python3.7/site-packages/opennmt/inputters/inputter.py”, line 269, in make_dataset
raise ValueError(“The number of data files must be the same as the number of inputters”)
ValueError: The number of data files must be the same as the number of inputters

And the error is because I have 2 inputters (one parallel and one wordembedder) and 6 data files which don’t match. My question is does onmt supports such kind of model and if not, what classed (encoder or inputter or both) I should overwrite to have this work? I am using the OpenNMT-tf==2.8

Thanks and best regards

guillaumekln · April 1, 2020, 7:13am

Hi,

This kind of architecture is supported.

To solve this particular error, you should replicate the inputter nesting when declaring the input files (train_features_file and eval_features_file). Can you try the following:

data:
  train_features_file:
    - - train_src_1_1.txt
      - train_src_1_2.txt
      - train_src_1_3.txt
      - train_src_1_4.txt
      - train_src_1_5.txt
    - train_src_2.txt