Multifeature translation question

Hi,

I am curious to implement and try a model similar to the one given here:

I am planning to use POS and Named Entity labels as additional features for the translation of two languages. However, I have a few gaps in my thought process as I am quite new to this field. I hope some researchers could help me think in the right manner here.

  1. The encoder here takes all the additional features using a ParallelInputter, meaning that I need to add the features to my source language sentences. How do I make use of similar additional features for my target language sentences as well? Like, how can I make the model learn relationships between various source and target POS tag sequences? How can I help the decoder narrow down its search by making use of the possible features in this case?

  2. When I do an inference using such trained model, is it necessary to input the sentence, POS tag sequence and the NE sequence along with the source sentence?

In your case, you will have 3 parallel files, one for the text, one for the POS tags, aand one for the named entity labels. The way to defined them in the configuration file is showcased here: http://opennmt.net/OpenNMT-tf/data.html#parallel-inputs

Additional features on the target side are not supported.

Yes, it is required.

Thank you very much. I will try to work with this.

Hi… Do I also need to give the vocbularies of the respective additional features in the yml file in addition to source_words_vocabulary and target_words_vocabulary?

I tried giving feature_1_vocabulary: path/to/my/feature_1_vocabulary.txt, yet I seem to get a NoneType error in text_inputter.py. Could you please help me with this?

Yes.

Could you post your model definition and configuration file?

This is my yml file:

model_dir: C:/Users/Me/kantelmulti/
data:
train_features_file:
- C:/Users/Me/kantrainsentences.txt
- C:/Users/Me/kantraintagset.txt
- C:/Users/Me/kantraingender.txt
- C:/Users/Me/kantrainnumber.txt
- C:/Users/Me/kantelmulti/kantrainperson.txt
train_labels_file:
- C:/Users/Me/kantelmulti/teltrainnew.txt
eval_features_file:
- C:/Users/Me/kantelmulti/kanvalsentences.txt
- C:/Users/Me/kantelmulti/kanvaltagset.txt
- C:/Users/Me//kantelmulti/kanvalgender.txt
- C:/Users/Me/kantelmulti/kanvalnumber.txt
- C:/Users/Me/kantelmulti/kanvalperson.txt
eval_labels_file:
- C:/Users/Me/kantelmulti/telvalnew.txt
source_words_vocabulary:
- C:/Users/Me/kantelmulti/kanvocab.txt
target_words_vocabulary:
- C:/Users/Me/kantelmulti/telvocab.txt
feature_1_vocabulary:
- C:/Users/Me/kantelmulti/tagvocab.txt
feature_2_vocabulary:
- C:/Users/Me/kantelmulti/gen.txt
feature_3_vocabulary:
- C:/Users/Me/kantelmulti/numvocab.txt
feature_4_vocabulary:
- C:/Users/Me/kantelmulti/pervocab.txt
train:
batch_size: 64
train_steps: 10000

And this is my model - multi.py

"""Defines a sequence to sequence model with multiple input features. For
example, this could be words, parts of speech, and lemmas that are embedded in
parallel and concatenated into a single input embedding. The features are
separate data files with separate vocabularies.
"""

import tensorflow as tf
import opennmt as onmt

def model():
  return onmt.models.SequenceToSequence(
      source_inputter=onmt.inputters.ParallelInputter([
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="source_words_vocabulary",
              embedding_size=512),
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_1_vocabulary",
              embedding_size=64),
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_2_vocabulary",
              embedding_size=16),
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_3_vocabulary",
              embedding_size=16),  
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_4_vocabulary",
              embedding_size=16)],
          reducer=onmt.layers.ConcatReducer()),
      target_inputter=onmt.inputters.WordEmbedder(
          vocabulary_file_key="target_words_vocabulary",
          embedding_size=512),
      encoder=onmt.encoders.BidirectionalRNNEncoder(
          num_layers=4,
          num_units=512,
          reducer=onmt.layers.ConcatReducer(),
          cell_class=tf.contrib.rnn.LSTMCell,
          dropout=0.3,
          residual_connections=False),
      decoder=onmt.decoders.AttentionalRNNDecoder(
          num_layers=4,
          num_units=512,
          bridge=onmt.layers.CopyBridge(),
          attention_mechanism_class=tf.contrib.seq2seq.LuongAttention,
          cell_class=tf.contrib.rnn.LSTMCell,
          dropout=0.3,
          residual_connections=False))

Earlier, I got an issue stating alignment_file_key being NoneType. I checked sequence_to_sequence.py in the models directory which was throwing this error and seen that it was taking an alignment file “train-align” which is not on my system.

I was not planning to use alignments so I just made alignment_file_key=None in line 85 of sequence_to_sequence.py , which got rid of the error. Was it a mistake? Is the vocabulary file dependent on this? Please tell me what I need to do.

This was the error I kept getting when alignment_file_key was not None:

Traceback (most recent call last):
  File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Aditya\Anaconda3\envs\tf_gpu\Scripts\onmt-main.exe\__main__.py", line 9, in <module>
  File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\site-packages\opennmt\bin\main.py", line 169, in main
    hvd=hvd)
  File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\site-packages\opennmt\runner.py", line 96, in __init__
    self._model.initialize(self._config["data"])
  File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\site-packages\opennmt\models\model.py", line 70, in initialize
    self.examples_inputter.initialize(metadata)
  File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\site-packages\opennmt\models\sequence_to_sequence.py", line 373, in initialize
    if self.alignment_file_key is not None and self.alignment_file_key in metadata:
TypeError: argument of type 'NoneType' is not iterable

Your YAML file is invalid:

  • indentation matters: you should indent everything that is under data and train
  • labels and vocabulary fields don’t take a list, i.e. you should just set the value like this:
train_labels_file: C:/Users/Me/kantelmulti/teltrainnew.txt

https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html

Thanks a lot. Greatly appreciated. I am able to get rid of the error now.

Is it now okay if I set alignment_file_key to None instead of leaving at the default value “train-alignments” in /models/sequence_to_sequence.py?

You don’t need to set or change this argument.

Thank you… I seem to have another issue with embedding sizes. I have given both the source and target word embedding sizes to 512. On the source side parallelnputter, I have given feature1 embedding size as 64, feature2 - 64, feature3-16 and feature4-16 respectively.

I seem to run into an error saying:

Assign requires shapes of both tensors to match lhs shape=[928,1024] rhs shape=[880,1024]

I am confused as to how these tensor sizes are calculated. How do I set these correctly? Kindly reply.

When do you get this error?

Whenever I try to train using the model file with the embedding sizes I mentioned above.

onmt-main train --model /path/multi.py --config myconfig.yml --auto_config

My model file is below:

"""Defines a sequence to sequence model with multiple input features. For
example, this could be words, parts of speech, and lemmas that are embedded in
parallel and concatenated into a single input embedding. The features are
separate data files with separate vocabularies.
"""

import tensorflow as tf
import opennmt as onmt

def model():
  return onmt.models.SequenceToSequence(
      source_inputter=onmt.inputters.ParallelInputter([
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="source_words_vocabulary",
              embedding_size=512),
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_1_vocabulary",
              embedding_size=64),
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_2_vocabulary",
              embedding_size=64),
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_3_vocabulary",
              embedding_size=16),  
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_4_vocabulary",
              embedding_size=16)],
          reducer=onmt.layers.ConcatReducer()),
      target_inputter=onmt.inputters.WordEmbedder(
          vocabulary_file_key="target_words_vocabulary",
          embedding_size=512),
      encoder=onmt.encoders.BidirectionalRNNEncoder(
          num_layers=4,
          num_units=512,
          reducer=onmt.layers.ConcatReducer(),
          cell_class=tf.contrib.rnn.LSTMCell,
          dropout=0.3,
          residual_connections=False),
      decoder=onmt.decoders.AttentionalRNNDecoder(
          num_layers=4,
          num_units=512,
          bridge=onmt.layers.CopyBridge(),
          attention_mechanism_class=tf.contrib.seq2seq.LuongAttention,
          cell_class=tf.contrib.rnn.LSTMCell,
          dropout=0.3,
          residual_connections=False))

Hello… I am still unable to figure out the proper way to set the embedding sizes. I have tried multiple combinations and every one resulted in a error involving the tensor shapes. Could you please help me out…

I think you started a training and then changed the embedding size. You should either delete the existing checkpoint or set the same embedding size.

@sanadi1209 @guillaumekln
can you provide intel on how you generated these files ?
Does every word in this tagged file need to have a separate tag saying its gender (or not ), number (or not) etc ? (snapshot or example would be great)

Thanks !

Hi… Sorry for the late response. I have been using the TDIL corpus. Some segments of the corpus have already been tagged with the POS tags and other info. I just parsed these tags and included word sequences in one file and corresponding POS tag sequences in another. Similarly for the gender tags (m-f-n for male, female and neutral). This was the method I followed in the question here.