Question regarding Parallel Inputter - problem with evaluation

sanadi1209 · June 8, 2020, 2:19pm

Hi… I am working with a model that takes two parallel inputs, word and its tag in tow separate files. I am using onmt ParallelInputter architecture for this purpose.

The word vocabulary size has been set to 100,000 using build-vocab.
The tag vocabulary size is 26.
Training batch size is 24 and evaluation batch size is 32.

Here is my config file:

model_dir: /content/gdrive/My Drive/opennmtmulti
data:
  train_features_file: 
    - '/content/gdrive/My Drive/opennmtmulti/kantrainsentences.txt'
    - '/content/gdrive/My Drive/opennmtmulti/kantraintagset.txt'
  train_labels_file: '/content/gdrive/My Drive/opennmtmulti/teltrain.txt'
  eval_features_file: 
    - '/content/gdrive/My Drive/opennmtmulti/kanvalsentences.txt'
    - '/content/gdrive/My Drive/opennmtmulti/kanvaltagset.txt'
  eval_labels_file: '/content/gdrive/My Drive/opennmtmulti/telval.txt'
  source_1_vocabulary: '/content/gdrive/My Drive/opennmtmulti/kanvocab2.txt'
  source_2_vocabulary: '/content/gdrive/My Drive/opennmtmulti/tagvocab.txt'
  target_vocabulary: '/content/gdrive/My Drive/opennmtmulti/telvocab2.txt'
params:
  optimizer: Adam
  learning_rate: 0.001
train:
  sample_buffer_size: 50000
  batch_size: 24
  train_steps: 500000

And the following is my model description file:

import tensorflow as tf
import opennmt as onmt
import tensorflow_addons as tfa

def model():
  return onmt.models.SequenceToSequence(
       source_inputter=onmt.inputters.ParallelInputter([
	   onmt.inputters.WordEmbedder(
              embedding_size=512),
          onmt.inputters.WordEmbedder(
              embedding_size=16)],
			  reducer=onmt.layers.ConcatReducer()),
      target_inputter=onmt.inputters.WordEmbedder(
          embedding_size=512),
      encoder=onmt.encoders.SelfAttentionEncoder(
          num_layers=4,
          num_units=512,
          dropout=0.3),
      decoder=onmt.decoders.AttentionalRNNDecoder(
          num_layers=4,
          num_units=512,
          attention_mechanism_class=tfa.seq2seq.LuongAttention,
          dropout=0.3))

My problem is that the training runs without errors for the first 5000 steps. After the checkpoint has been saved and trying to evaluate on the validation set, I am getting the following error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: ConcatOp : Dimensions of inputs should match: shape[0] = [32,26,512] vs. shape[1] = [32,23,16]
[[node sequence_to_sequence_1/parallel_inputter_1/concat (defined at /lib/python3.6/dist-packages/opennmt/layers/reducer.py:156) ]]
[[sequence_to_sequence_1/while/LoopCond/_197/_164]]
(1) Invalid argument: ConcatOp : Dimensions of inputs should match: shape[0] = [32,26,512] vs. shape[1] = [32,23,16]
[[node sequence_to_sequence_1/parallel_inputter_1/concat (defined at /lib/python3.6/dist-packages/opennmt/layers/reducer.py:156) ]]

I gather that the embedding size is not what is expected by the program here, but not sure of where to change. Can different features have different sized embeddings in ParallelInputter? Where am I going wrong? What does 23 in this shape[1] tensor refer to? Please help me.

guillaumekln · June 8, 2020, 2:23pm

26 and 23 refer to the sequence lengths, so it seems that at least one line in your files has more words than tags. You should check that your word and tag files are correctly aligned.

sanadi1209 · June 8, 2020, 2:28pm

I will check. How about the discrepancy between 512 and 16? Where do I need to take care of this?

guillaumekln · June 8, 2020, 2:30pm

That’s not an issue because this is the dimension where the embeddings are concatenated. The output dimension will be 512+16=528.

sanadi1209 · June 8, 2020, 3:27pm

Thank you for the prompt help. I think I fixed the problem.