Character seq2seq - any example / tutorial?

Hello!

I’m running the included example properly (toy_ende) but I can’t manage to adapt it to character sequence (for source & target). So I was wondering if there was any tutorial on running the most basic seq2seq with character sequence enc<>dec?

Here is my current config:

def model():
      return onmt.models.SequenceToSequence(
          source_inputter=onmt.inputters.CharConvEmbedder(
                  vocabulary_file_key="source_chars_vocabulary",
                  embedding_size=100,
                  num_outputs=100,
                  kernel_size=3,
                  stride=1,
                  dropout=0.5,
                  tokenizer=onmt.tokenizers.CharacterTokenizer()),
          target_inputter=onmt.inputters.WordEmbedder(
              vocabulary_file_key="target_chars_vocabulary",
              embedding_size=100,
              tokenizer=onmt.tokenizers.CharacterTokenizer()),
          encoder=onmt.encoders.UnidirectionalRNNEncoder(
              num_layers=2,
              num_units=150,
              cell_class=tf.contrib.rnn.LSTMCell,
              dropout=0.3,
              residual_connections=False),
          decoder=onmt.decoders.AttentionalRNNDecoder(
              num_layers=2,
              num_units=150,
              bridge=onmt.utils.CopyBridge(),
              attention_mechanism_class=tf.contrib.seq2seq.LuongAttention,
              cell_class=tf.contrib.rnn.LSTMCell,
              dropout=0.3,
              residual_connections=False))

And the Yaml:

    data:
      train_features_file: data/transliteration_v1/src-train.txt
      train_labels_file: data/transliteration_v1/tgt-train.txt
      eval_features_file: data/transliteration_v1/src-val.txt
      eval_labels_file: data/transliteration_v1/tgt-val.txt
      source_chars_vocabulary: data/transliteration_v1/src-vocab.txt
      target_chars_vocabulary: data/transliteration_v1/tgt-vocab.txt

The error I’m getting:

    tensorflow.python.framework.errors_impl.InvalidArgumentError: TensorArray has inconsistent shapes.  Index 0 has shape: [1] but index 5 has shape: [0]
    	 [[Node: map_2/TensorArrayStack/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:@map_2/TensorArray_1"], dtype=DT_STRING, element_shape=[?]](map_2/TensorArray_1, map_2/TensorArrayStack/range, map_2/while/Exit_1)]]
    	 [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?,?,?], [?], [?,?], [?,?], [?]], output_types=[DT_INT64, DT_INT32, DT_INT64, DT_INT64, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator)]]

Hello,

For a basic character seq2seq model, you should just set the character tokenizer, for example:

import tensorflow as tf
import opennmt as onmt

def model():
  return onmt.models.SequenceToSequence(
      source_inputter=onmt.inputters.WordEmbedder(
          vocabulary_file_key="source_chars_vocabulary",
          embedding_size=30,
          tokenizer=onmt.tokenizers.CharacterTokenizer()),
      target_inputter=onmt.inputters.WordEmbedder(
          vocabulary_file_key="target_chars_vocabulary",
          embedding_size=30,
          tokenizer=onmt.tokenizers.CharacterTokenizer()),
      encoder=onmt.encoders.UnidirectionalRNNEncoder(
          num_layers=2,
          num_units=512,
          cell_class=tf.contrib.rnn.LSTMCell,
          dropout=0.3,
          residual_connections=False),
      decoder=onmt.decoders.AttentionalRNNDecoder(
          num_layers=2,
          num_units=512,
          bridge=onmt.utils.CopyBridge(),
          attention_mechanism_class=tf.contrib.seq2seq.LuongAttention,
          cell_class=tf.contrib.rnn.LSTMCell,
          dropout=0.3,
          residual_connections=False))

Also, make sure that you generated your vocabulary with the --tokenizer CharacterTokenizer option.

Thanks Guillaume, that was really helpful!

With a target inputter having a CharacterTokenizer, the output is a sequence of characters separated by spaces, is there anyway “config option” to have the reconstructed output without the spaces? Can of course write a script, was just wondering if I was missing a configuration )

I’m working on adding the detokenization logic. As this detokenization is very simple, it should be easy for you to write a simple script to restore the sentence.

We should also replace existing space with a special character to avoid confusion.

The 2 features mentioned in my previous post are now on master (requires a retraining when using the CharacterTokenizer). Inference and evaluation predictions are automatically detokenized.