I am curious to implement and try a model similar to the one given here:
I am planning to use POS and Named Entity labels as additional features for the translation of two languages. However, I have a few gaps in my thought process as I am quite new to this field. I hope some researchers could help me think in the right manner here.
The encoder here takes all the additional features using a ParallelInputter, meaning that I need to add the features to my source language sentences. How do I make use of similar additional features for my target language sentences as well? Like, how can I make the model learn relationships between various source and target POS tag sequences? How can I help the decoder narrow down its search by making use of the possible features in this case?
When I do an inference using such trained model, is it necessary to input the sentence, POS tag sequence and the NE sequence along with the source sentence?
In your case, you will have 3 parallel files, one for the text, one for the POS tags, aand one for the named entity labels. The way to defined them in the configuration file is showcased here: http://opennmt.net/OpenNMT-tf/data.html#parallel-inputs
Additional features on the target side are not supported.
Hi… Do I also need to give the vocbularies of the respective additional features in the yml file in addition to source_words_vocabulary and target_words_vocabulary?
I tried giving feature_1_vocabulary: path/to/my/feature_1_vocabulary.txt, yet I seem to get a NoneType error in text_inputter.py. Could you please help me with this?
"""Defines a sequence to sequence model with multiple input features. For
example, this could be words, parts of speech, and lemmas that are embedded in
parallel and concatenated into a single input embedding. The features are
separate data files with separate vocabularies.
"""
import tensorflow as tf
import opennmt as onmt
def model():
return onmt.models.SequenceToSequence(
source_inputter=onmt.inputters.ParallelInputter([
onmt.inputters.WordEmbedder(
vocabulary_file_key="source_words_vocabulary",
embedding_size=512),
onmt.inputters.WordEmbedder(
vocabulary_file_key="feature_1_vocabulary",
embedding_size=64),
onmt.inputters.WordEmbedder(
vocabulary_file_key="feature_2_vocabulary",
embedding_size=16),
onmt.inputters.WordEmbedder(
vocabulary_file_key="feature_3_vocabulary",
embedding_size=16),
onmt.inputters.WordEmbedder(
vocabulary_file_key="feature_4_vocabulary",
embedding_size=16)],
reducer=onmt.layers.ConcatReducer()),
target_inputter=onmt.inputters.WordEmbedder(
vocabulary_file_key="target_words_vocabulary",
embedding_size=512),
encoder=onmt.encoders.BidirectionalRNNEncoder(
num_layers=4,
num_units=512,
reducer=onmt.layers.ConcatReducer(),
cell_class=tf.contrib.rnn.LSTMCell,
dropout=0.3,
residual_connections=False),
decoder=onmt.decoders.AttentionalRNNDecoder(
num_layers=4,
num_units=512,
bridge=onmt.layers.CopyBridge(),
attention_mechanism_class=tf.contrib.seq2seq.LuongAttention,
cell_class=tf.contrib.rnn.LSTMCell,
dropout=0.3,
residual_connections=False))
Earlier, I got an issue stating alignment_file_key being NoneType. I checked sequence_to_sequence.py in the models directory which was throwing this error and seen that it was taking an alignment file “train-align” which is not on my system.
I was not planning to use alignments so I just made alignment_file_key=None in line 85 of sequence_to_sequence.py , which got rid of the error. Was it a mistake? Is the vocabulary file dependent on this? Please tell me what I need to do.
This was the error I kept getting when alignment_file_key was not None:
Traceback (most recent call last):
File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\Aditya\Anaconda3\envs\tf_gpu\Scripts\onmt-main.exe\__main__.py", line 9, in <module>
File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\site-packages\opennmt\bin\main.py", line 169, in main
hvd=hvd)
File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\site-packages\opennmt\runner.py", line 96, in __init__
self._model.initialize(self._config["data"])
File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\site-packages\opennmt\models\model.py", line 70, in initialize
self.examples_inputter.initialize(metadata)
File "c:\users\aditya\anaconda3\envs\tf_gpu\lib\site-packages\opennmt\models\sequence_to_sequence.py", line 373, in initialize
if self.alignment_file_key is not None and self.alignment_file_key in metadata:
TypeError: argument of type 'NoneType' is not iterable
Thank you… I seem to have another issue with embedding sizes. I have given both the source and target word embedding sizes to 512. On the source side parallelnputter, I have given feature1 embedding size as 64, feature2 - 64, feature3-16 and feature4-16 respectively.
I seem to run into an error saying:
Assign requires shapes of both tensors to match lhs shape=[928,1024] rhs shape=[880,1024]
I am confused as to how these tensor sizes are calculated. How do I set these correctly? Kindly reply.
"""Defines a sequence to sequence model with multiple input features. For
example, this could be words, parts of speech, and lemmas that are embedded in
parallel and concatenated into a single input embedding. The features are
separate data files with separate vocabularies.
"""
import tensorflow as tf
import opennmt as onmt
def model():
return onmt.models.SequenceToSequence(
source_inputter=onmt.inputters.ParallelInputter([
onmt.inputters.WordEmbedder(
vocabulary_file_key="source_words_vocabulary",
embedding_size=512),
onmt.inputters.WordEmbedder(
vocabulary_file_key="feature_1_vocabulary",
embedding_size=64),
onmt.inputters.WordEmbedder(
vocabulary_file_key="feature_2_vocabulary",
embedding_size=64),
onmt.inputters.WordEmbedder(
vocabulary_file_key="feature_3_vocabulary",
embedding_size=16),
onmt.inputters.WordEmbedder(
vocabulary_file_key="feature_4_vocabulary",
embedding_size=16)],
reducer=onmt.layers.ConcatReducer()),
target_inputter=onmt.inputters.WordEmbedder(
vocabulary_file_key="target_words_vocabulary",
embedding_size=512),
encoder=onmt.encoders.BidirectionalRNNEncoder(
num_layers=4,
num_units=512,
reducer=onmt.layers.ConcatReducer(),
cell_class=tf.contrib.rnn.LSTMCell,
dropout=0.3,
residual_connections=False),
decoder=onmt.decoders.AttentionalRNNDecoder(
num_layers=4,
num_units=512,
bridge=onmt.layers.CopyBridge(),
attention_mechanism_class=tf.contrib.seq2seq.LuongAttention,
cell_class=tf.contrib.rnn.LSTMCell,
dropout=0.3,
residual_connections=False))
Hello… I am still unable to figure out the proper way to set the embedding sizes. I have tried multiple combinations and every one resulted in a error involving the tensor shapes. Could you please help me out…
@sanadi1209@guillaumekln
can you provide intel on how you generated these files ?
Does every word in this tagged file need to have a separate tag saying its gender (or not ), number (or not) etc ? (snapshot or example would be great)
Hi… Sorry for the late response. I have been using the TDIL corpus. Some segments of the corpus have already been tagged with the POS tags and other info. I just parsed these tags and included word sequences in one file and corresponding POS tag sequences in another. Similarly for the gender tags (m-f-n for male, female and neutral). This was the method I followed in the question here.