Multimodal feature input for seq2seq

DataHunter · August 4, 2021, 8:08am

Hi,
I’m currently working on a Sequence-to-Sequence problem using OpenNMT but I have a source query, along with categorical data, for example, in a retail scenario, I have “Phone name specification” as a query and I have extra information category “Phone” to be added as input as well to get an output sequence. Basically I need to pass both input sequences for generating the output but I’m currently not able to do that since I’m able to create text files containing source and target texts which is only a single source. Any idea how I can do this?

For example,
Src Query: "iphone 12x 64 gb size white color "
Category: “Mobile Phone”

Output: “iPhone 12x 64gb (white)”

Here, I want to pass both src query and category pairs as input for generating output rather than only src query but couldn’t find any relevant tutorials for using OpenNMT. Please let me know how I can do the same?
Thank you.

ymoslem · August 4, 2021, 3:50pm

Hi Ahan!

This can be done by OpenNMT-tf multi-source inputs.

The problem description here is unclear to me. It seems like something that one can solve with some find-replace operation. Also, I do not see why this is a “multimodal” problem. It is simply a multi-input one. There might be other aspects you did not mention in the post; I am just saying if this is a research paper or the like, such subtleties should be clarified.

All the best,
Yasmin

DataHunter · August 8, 2021, 10:02pm

Thank you so much @ymoslem This is really helpful! Yes, apologies, I did not share the multimodal aspect of the problem. But just needed a way to parse multiple sources into the model for building a seq2seq model.

DataHunter · August 11, 2021, 1:42pm

I’ve been getting the following error when I try to get an output using the below files attached (data.yaml and custom_model.py), it has two inputs using parallelinputter, but I don’t understand what this issue is. Can you tell me if the data.yaml or custom_model.py is missing something currently?

2021-08-11 13:38:53.036000: I main.py:326] Using parameters:
data:
  eval_features_file:
  - qcr/src1_test.txt
  - qcr/src2_test.txt
  eval_labels_file: qcr/tgt_test.txt
  source_1_vocabulary: qcr/src1.vocab
  source_2_vocabulary: qcr/src2.vocab
  target_vocabulary: qcr/tgt.vocab
  train_features_file:
  - qcr/src1_train.txt
  - qcr/src2_train.txt
  train_labels_file: qcr/tgt_train.txt
eval:
  batch_size: 2
  batch_type: examples
  eval_delay: 3600
  external_evaluators: BLEU
  length_bucket_width: 5
infer:
  batch_size: 2
  batch_type: examples
  length_bucket_width: 5
model_dir: qcr/run/
params:
  average_loss_in_time: true
  beam_width: 4
  decay_params:
    model_dim: 512
    warmup_steps: 8000
  decay_type: NoamDecay
  label_smoothing: 0.1
  learning_rate: 2.0
  num_hypotheses: 1
  optimizer: LazyAdam
  optimizer_params:
    beta_1: 0.9
    beta_2: 0.998
score:
  batch_size: 2
  batch_type: examples
  length_bucket_width: 5
train:
  average_last_checkpoints: 8
  batch_size: 2
  batch_type: tokens
  effective_batch_size: 25000
  keep_checkpoint_max: 8
  length_bucket_width: 1
  max_step: 5000
  maximum_features_length:
  - 100
  - 100
  maximum_labels_length: 100
  sample_buffer_size: -1
  save_checkpoints_steps: 1000
  save_summary_steps: 100

2021-08-11 13:38:53.241000: I inputter.py:309] Initialized source_1 input layer:
2021-08-11 13:38:53.241000: I inputter.py:309]  - vocabulary size: 13001
2021-08-11 13:38:53.241000: I inputter.py:309]  - special tokens: BOS=no, EOS=no
2021-08-11 13:38:53.251000: I inputter.py:309] Initialized source_2 input layer:
2021-08-11 13:38:53.251000: I inputter.py:309]  - vocabulary size: 2499
2021-08-11 13:38:53.251000: I inputter.py:309]  - special tokens: BOS=no, EOS=no
2021-08-11 13:38:53.299000: I inputter.py:309] Initialized target input layer:
2021-08-11 13:38:53.299000: I inputter.py:309]  - vocabulary size: 13001
2021-08-11 13:38:53.299000: I inputter.py:309]  - special tokens: BOS=yes, EOS=yes
2021-08-11 13:38:53.431000: W runner.py:242] No checkpoint to restore in qcr/run/
2021-08-11 13:38:53.433000: W deprecation.py:336] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/summary/summary_iterator.py:31: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
2021-08-11 13:38:54.989000: I main.py:326] Accumulate gradients of 12500 iterations to reach effective batch size of 25000
2021-08-11 13:38:55.032000: I dataset_ops.py:2120] Training on 912418 examples
2021-08-11 13:38:55.941997: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-08-11 13:38:55.945570: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2299995000 Hz
2021-08-11 13:39:12.030000: I control_flow.py:1225] Number of model parameters: 71712457
2021-08-11 13:39:12.877000: I control_flow.py:1225] Number of model weights: 321 (trainable = 321, non trainable = 0)
2021-08-11 13:39:43.039162: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-11 13:39:43.415667: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2021-08-11 13:39:43.416777: W ./tensorflow/stream_executor/stream.h:2140] attempting to perform DNN operation using StreamExecutor without DNN support
2021-08-11 13:39:43.419378: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2021-08-11 13:39:43.420335: W ./tensorflow/stream_executor/stream.h:2140] attempting to perform DNN operation using StreamExecutor without DNN support
2021-08-11 13:39:43.423007: E tensorflow/stream_executor/cuda/cuda_dnn.cc:352] Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2021-08-11 13:39:43.423975: W ./tensorflow/stream_executor/stream.h:2140] attempting to perform DNN operation using StreamExecutor without DNN support
Traceback (most recent call last):
  File "/opt/conda/bin/onmt-main", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.7/site-packages/opennmt/bin/main.py", line 326, in main
    hvd=hvd,
  File "/opt/conda/lib/python3.7/site-packages/opennmt/runner.py", line 281, in train
    moving_average_decay=train_config.get("moving_average_decay"),
  File "/opt/conda/lib/python3.7/site-packages/opennmt/training.py", line 123, in __call__
    dataset, accum_steps=accum_steps, report_steps=report_steps
  File "/opt/conda/lib/python3.7/site-packages/opennmt/training.py", line 260, in _steps
    loss = forward_fn()
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
    result = self._call(*args, **kwds)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3024, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1961, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 596, in call
    ctx=ctx)
  File "/opt/conda/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal:  cuDNN launch failure : input shape ([1,2,512,1])
	 [[node dual_source_transformer_1/parallel_encoder_1/self_attention_encoder_2/self_attention_encoder_layer_12/transformer_layer_wrapper_48/layer_norm_52/FusedBatchNormV3_1 (defined at /lib/python3.7/site-packages/opennmt/layers/common.py:128) ]]
	 [[Func/gradients/global_norm/write_summary/summary_cond/then/_328/input/_987/_62]]
  (1) Internal:  cuDNN launch failure : input shape ([1,2,512,1])
	 [[node dual_source_transformer_1/parallel_encoder_1/self_attention_encoder_2/self_attention_encoder_layer_12/transformer_layer_wrapper_48/layer_norm_52/FusedBatchNormV3_1 (defined at /lib/python3.7/site-packages/opennmt/layers/common.py:128) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference__forward_48335]

Function call stack:
_forward -> _forward

data.yaml file:

model_dir: qcr/run/

data:
  train_features_file: 
      - qcr/src1_train.txt
      - qcr/src2_train.txt
  train_labels_file: qcr/tgt_train.txt
  eval_features_file: 
      - qcr/src1_test.txt
      - qcr/src2_test.txt
  eval_labels_file: qcr/tgt_test.txt
  source_1_vocabulary: qcr/src1.vocab
  source_2_vocabulary: qcr/src2.vocab
  target_vocabulary: qcr/tgt.vocab

train:
    batch_size: 2
    save_checkpoints_steps: 1000
    max_step: 5000
eval:
    eval_delay: 3600  # Every 1 hour
    batch_size: 2
    external_evaluators: BLEU
score:
    batch_size: 2
infer:
    batch_size: 2

custom_model.py

import opennmt

from opennmt.utils import misc
import argparse
import logging
import tensorflow as tf
import tensorflow_addons as tfa

tf.get_logger().setLevel(logging.INFO)

class DualSourceTransformer(opennmt.models.Transformer):
    def __init__(self):
        super().__init__(
            source_inputter=opennmt.inputters.ParallelInputter(
                [
                    opennmt.inputters.WordEmbedder(embedding_size=512),
                    opennmt.inputters.WordEmbedder(embedding_size=512),
                ]
            ),
            target_inputter=opennmt.inputters.WordEmbedder(embedding_size=512),
            num_layers=6,
            num_units=512,
            num_heads=8,
            ffn_inner_dim=2048,
            dropout=0.1,
            attention_dropout=0.1,
            ffn_dropout=0.1,
            share_encoders=True,
        )

    def auto_config(self, num_replicas=1):
        config = super().auto_config(num_replicas=num_replicas)
        max_length = config["train"]["maximum_features_length"]
        return misc.merge_dict(
            config, {"train": {"maximum_features_length": [max_length, max_length]}}
        )


model = DualSourceTransformer

ymoslem · August 11, 2021, 6:08pm

Dear Anan,

The traceback seems to have multiple CUDA issues.

What do you get when you run this line?

echo $CUDA_VISIBLE_DEVICES

If there is nothing, activate your GPU. Here I have 2 GPUs, so my command is:

CUDA_VISIBLE_DEVICES=0,1

Are you able to run a regular (not custom) training on GPU, e.g. the QuickStart? To use GPUs, you have to add --num_gpus 2 for two GPUs, or whatever you have. For example:

onmt-main --model_type Transformer --config data.yml --auto_config train --with_eval --num_gpus 2

I suggest you make sure the QuickStart works for you on a GPU before running the custom training.

I am adding @guillaumekln for better advice.

Kind regards,
Yasmin

DataHunter · August 11, 2021, 11:06pm

Hi @ymoslem and @guillaumekln ,
I figured earlier that the issue was with the cudnn library compatibility, so I upgraded from cudnn==8.0.5 to cudnn==8.1.0, so I was able to rectify the previous issue, and the model was able to estimate the number of parameters, but freezes after that. I have been unable to move beyond this:

2021-08-11 14:14:57.024000: I inputter.py:309] Initialized source_1 input layer:
2021-08-11 14:14:57.024000: I inputter.py:309]  - vocabulary size: 13001
2021-08-11 14:14:57.025000: I inputter.py:309]  - special tokens: BOS=no, EOS=no
2021-08-11 14:14:57.034000: I inputter.py:309] Initialized source_2 input layer:
2021-08-11 14:14:57.034000: I inputter.py:309]  - vocabulary size: 2499
2021-08-11 14:14:57.034000: I inputter.py:309]  - special tokens: BOS=no, EOS=no
2021-08-11 14:14:57.078000: I inputter.py:309] Initialized target input layer:
2021-08-11 14:14:57.078000: I inputter.py:309]  - vocabulary size: 13001
2021-08-11 14:14:57.078000: I inputter.py:309]  - special tokens: BOS=yes, EOS=yes
2021-08-11 14:14:57.210000: W runner.py:242] No checkpoint to restore in qcr/run/
2021-08-11 14:14:57.212000: W deprecation.py:336] From /opt/conda/lib/python3.7/site-packages/tensorflow/python/summary/summary_iterator.py:31: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and: 
`tf.data.TFRecordDataset(path)`
2021-08-11 14:14:58.663000: I main.py:326] Accumulate gradients of 12500 iterations to reach effective batch size of 25000
2021-08-11 14:14:58.706000: I dataset_ops.py:2120] Training on 912418 examples
2021-08-11 14:14:59.585849: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-08-11 14:14:59.589917: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2299995000 Hz
2021-08-11 14:15:14.985000: I control_flow.py:1225] Number of model parameters: 71712457
2021-08-11 14:15:15.722000: I control_flow.py:1225] Number of model weights: 321 (trainable = 321, non trainable = 0)
2021-08-11 14:15:44.426211: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-08-11 14:15:44.879476: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8201
2021-08-11 14:15:44.956435: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11

2021-08-11 14:15:45.341955: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11

After this, there is nothing being printed and I am stuck trying to debug what seems to be the issue beyond this. I have my GPU configured as discussed by you earlier, and I’m currently using a Tesla V100 GPU with 16GB memory and my CUDA seems to be configured as well.

But, when I run echo $CUDA_VISIBLE_DEVICES, I get nothing!! Do you think this has got something to do this being the reason my code isn’t training after that step?

DataHunter · August 11, 2021, 11:17pm

I was successfully able to move a step forward, using this, since I set
export CUDA_VISIBLE_DEVICES=0

!CUDA_VISIBLE_DEVICES=0 onmt-main --auto_config --config data.yaml --model custom_model.py train --with_eval --num_gpus=1

But still stuck at same error as stated above.

ymoslem · August 11, 2021, 11:56pm

You do not have errors any more, which is good. Please change batch_size to either some larger number like 1024, or 0 to let the model decide the largest number possible, and wait a few minutes to see what you get.