Multifeature translation question

sanadi1209 · August 17, 2019, 2:03pm

Hi,

I am curious to implement and try a model similar to the one given here:

OpenNMT/OpenNMT-tf/blob/master/config/models/multi_features_nmt.py

"""Defines a sequence to sequence model with multiple input features. For
example, this could be words, parts of speech, and lemmas that are embedded in
parallel and concatenated into a single input embedding. The features are
separate data files with separate vocabularies.
"""

import tensorflow as tf
import opennmt as onmt

def model():
  return onmt.models.SequenceToSequence(
      source_inputter=onmt.inputters.ParallelInputter([
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="source_words_vocabulary",
              embedding_size=512),
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_1_vocabulary",
              embedding_size=16),
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_2_vocabulary",

This file has been truncated. show original

I am planning to use POS and Named Entity labels as additional features for the translation of two languages. However, I have a few gaps in my thought process as I am quite new to this field. I hope some researchers could help me think in the right manner here.

The encoder here takes all the additional features using a ParallelInputter, meaning that I need to add the features to my source language sentences. How do I make use of similar additional features for my target language sentences as well? Like, how can I make the model learn relationships between various source and target POS tag sequences? How can I help the decoder narrow down its search by making use of the possible features in this case?
When I do an inference using such trained model, is it necessary to input the sentence, POS tag sequence and the NE sequence along with the source sentence?

guillaumekln · August 17, 2019, 2:12pm

In your case, you will have 3 parallel files, one for the text, one for the POS tags, aand one for the named entity labels. The way to defined them in the configuration file is showcased here: http://opennmt.net/OpenNMT-tf/data.html#parallel-inputs

Additional features on the target side are not supported.

Yes, it is required.

sanadi1209 · August 18, 2019, 3:46pm

Thank you very much. I will try to work with this.

sanadi1209 · August 19, 2019, 4:18pm

Hi… Do I also need to give the vocbularies of the respective additional features in the yml file in addition to source_words_vocabulary and target_words_vocabulary?

I tried giving feature_1_vocabulary: path/to/my/feature_1_vocabulary.txt, yet I seem to get a NoneType error in text_inputter.py. Could you please help me with this?

guillaumekln · August 19, 2019, 5:16pm

Yes.

Could you post your model definition and configuration file?

sanadi1209 · August 19, 2019, 5:27pm

This is my yml file:

model_dir: C:/Users/Me/kantelmulti/
data:
train_features_file:
- C:/Users/Me/kantrainsentences.txt
- C:/Users/Me/kantraintagset.txt
- C:/Users/Me/kantraingender.txt
- C:/Users/Me/kantrainnumber.txt
- C:/Users/Me/kantelmulti/kantrainperson.txt
train_labels_file:
- C:/Users/Me/kantelmulti/teltrainnew.txt
eval_features_file:
- C:/Users/Me/kantelmulti/kanvalsentences.txt
- C:/Users/Me/kantelmulti/kanvaltagset.txt
- C:/Users/Me//kantelmulti/kanvalgender.txt
- C:/Users/Me/kantelmulti/kanvalnumber.txt
- C:/Users/Me/kantelmulti/kanvalperson.txt
eval_labels_file:
- C:/Users/Me/kantelmulti/telvalnew.txt
source_words_vocabulary:
- C:/Users/Me/kantelmulti/kanvocab.txt
target_words_vocabulary:
- C:/Users/Me/kantelmulti/telvocab.txt
feature_1_vocabulary:
- C:/Users/Me/kantelmulti/tagvocab.txt
feature_2_vocabulary:
- C:/Users/Me/kantelmulti/gen.txt
feature_3_vocabulary:
- C:/Users/Me/kantelmulti/numvocab.txt
feature_4_vocabulary:
- C:/Users/Me/kantelmulti/pervocab.txt
train:
batch_size: 64
train_steps: 10000

sanadi1209 · August 19, 2019, 5:27pm

And this is my model - multi.py

"""Defines a sequence to sequence model with multiple input features. For
example, this could be words, parts of speech, and lemmas that are embedded in
parallel and concatenated into a single input embedding. The features are
separate data files with separate vocabularies.
"""

import tensorflow as tf
import opennmt as onmt

def model():
  return onmt.models.SequenceToSequence(
      source_inputter=onmt.inputters.ParallelInputter([
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="source_words_vocabulary",
              embedding_size=512),
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_1_vocabulary",
              embedding_size=64),
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_2_vocabulary",
              embedding_size=16),
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_3_vocabulary",
              embedding_size=16),  
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_4_vocabulary",
              embedding_size=16)],
          reducer=onmt.layers.ConcatReducer()),
      target_inputter=onmt.inputters.WordEmbedder(
          vocabulary_file_key="target_words_vocabulary",
          embedding_size=512),
      encoder=onmt.encoders.BidirectionalRNNEncoder(
          num_layers=4,
          num_units=512,
          reducer=onmt.layers.ConcatReducer(),
          cell_class=tf.contrib.rnn.LSTMCell,
          dropout=0.3,
          residual_connections=False),
      decoder=onmt.decoders.AttentionalRNNDecoder(
          num_layers=4,
          num_units=512,
          bridge=onmt.layers.CopyBridge(),
          attention_mechanism_class=tf.contrib.seq2seq.LuongAttention,
          cell_class=tf.contrib.rnn.LSTMCell,
          dropout=0.3,
          residual_connections=False))

sanadi1209 · August 19, 2019, 5:32pm

Earlier, I got an issue stating alignment_file_key being NoneType. I checked sequence_to_sequence.py in the models directory which was throwing this error and seen that it was taking an alignment file “train-align” which is not on my system.

I was not planning to use alignments so I just made alignment_file_key=None in line 85 of sequence_to_sequence.py , which got rid of the error. Was it a mistake? Is the vocabulary file dependent on this? Please tell me what I need to do.

sanadi1209 · August 19, 2019, 6:33pm

This was the error I kept getting when alignment_file_key was not None:

Traceback (most recent call last):
  File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Aditya\Anaconda3\envs\tf_gpu\Scripts\onmt-main.exe\__main__.py", line 9, in <module>
  File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\site-packages\opennmt\bin\main.py", line 169, in main
    hvd=hvd)
  File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\site-packages\opennmt\runner.py", line 96, in __init__
    self._model.initialize(self._config["data"])
  File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\site-packages\opennmt\models\model.py", line 70, in initialize
    self.examples_inputter.initialize(metadata)
  File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\site-packages\opennmt\models\sequence_to_sequence.py", line 373, in initialize
    if self.alignment_file_key is not None and self.alignment_file_key in metadata:
TypeError: argument of type 'NoneType' is not iterable

guillaumekln · August 20, 2019, 7:29am

Your YAML file is invalid:

indentation matters: you should indent everything that is under data and train
labels and vocabulary fields don’t take a list, i.e. you should just set the value like this:

train_labels_file: C:/Users/Me/kantelmulti/teltrainnew.txt

https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html

sanadi1209 · August 20, 2019, 8:51am

Thanks a lot. Greatly appreciated. I am able to get rid of the error now.

Is it now okay if I set alignment_file_key to None instead of leaving at the default value “train-alignments” in /models/sequence_to_sequence.py?

guillaumekln · August 20, 2019, 7:04pm

You don’t need to set or change this argument.

sanadi1209 · August 21, 2019, 3:32pm

Thank you… I seem to have another issue with embedding sizes. I have given both the source and target word embedding sizes to 512. On the source side parallelnputter, I have given feature1 embedding size as 64, feature2 - 64, feature3-16 and feature4-16 respectively.

I seem to run into an error saying:

Assign requires shapes of both tensors to match lhs shape=[928,1024] rhs shape=[880,1024]

I am confused as to how these tensor sizes are calculated. How do I set these correctly? Kindly reply.

guillaumekln · August 21, 2019, 7:10pm

When do you get this error?

sanadi1209 · August 22, 2019, 6:03am

Whenever I try to train using the model file with the embedding sizes I mentioned above.

onmt-main train --model /path/multi.py --config myconfig.yml --auto_config

My model file is below:

"""Defines a sequence to sequence model with multiple input features. For
example, this could be words, parts of speech, and lemmas that are embedded in
parallel and concatenated into a single input embedding. The features are
separate data files with separate vocabularies.
"""

import tensorflow as tf
import opennmt as onmt

def model():
  return onmt.models.SequenceToSequence(
      source_inputter=onmt.inputters.ParallelInputter([
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="source_words_vocabulary",
              embedding_size=512),
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_1_vocabulary",
              embedding_size=64),
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_2_vocabulary",
              embedding_size=64),
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_3_vocabulary",
              embedding_size=16),  
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_4_vocabulary",
              embedding_size=16)],
          reducer=onmt.layers.ConcatReducer()),
      target_inputter=onmt.inputters.WordEmbedder(
          vocabulary_file_key="target_words_vocabulary",
          embedding_size=512),
      encoder=onmt.encoders.BidirectionalRNNEncoder(
          num_layers=4,
          num_units=512,
          reducer=onmt.layers.ConcatReducer(),
          cell_class=tf.contrib.rnn.LSTMCell,
          dropout=0.3,
          residual_connections=False),
      decoder=onmt.decoders.AttentionalRNNDecoder(
          num_layers=4,
          num_units=512,
          bridge=onmt.layers.CopyBridge(),
          attention_mechanism_class=tf.contrib.seq2seq.LuongAttention,
          cell_class=tf.contrib.rnn.LSTMCell,
          dropout=0.3,
          residual_connections=False))

sanadi1209 · August 22, 2019, 6:45pm

Hello… I am still unable to figure out the proper way to set the embedding sizes. I have tried multiple combinations and every one resulted in a error involving the tensor shapes. Could you please help me out…

guillaumekln · August 23, 2019, 6:54pm

I think you started a training and then changed the embedding size. You should either delete the existing checkpoint or set the same embedding size.

mayub · February 12, 2020, 10:39pm

@sanadi1209 @guillaumekln
can you provide intel on how you generated these files ?
Does every word in this tagged file need to have a separate tag saying its gender (or not ), number (or not) etc ? (snapshot or example would be great)

Thanks !

sanadi1209 · March 30, 2020, 11:11am

Hi… Sorry for the late response. I have been using the TDIL corpus. Some segments of the corpus have already been tagged with the POS tags and other info. I just parsed these tags and included word sequences in one file and corresponding POS tag sequences in another. Similarly for the gender tags (m-f-n for male, female and neutral). This was the method I followed in the question here.