What do SequenceRecordInputter in Tensorflow with unknown keys?

tensorflow

(Lockder) #1

Well the title already say it.

I know the word embedding has an out of the bucket param(oov) to handle the unknown words.
But since the ark file I guess its like a word embedding m vectors with h depths. What is going to do the TFRecords when trying to do a prediction, the system will receive a m +1 vector (a key outside the embedding) ?


(Guillaume Klein) #2

The SequenceRecordInputter has no keys, only the actual vector that will be fed into the encoder.

Keys are only used in the ARK representation to align vector and text. Once a match is found, the key is no longer useful.


(Lockder) #3

Thanks for the quick answer I don’t fully get it, i find the documentation missing some info to fully understand it.
What means “align vector”?
So I guess if I pass a text ( a sentence ) and a ark file representation. How this is going to be feed to SequenceRecordInputter ?


(Guillaume Klein) #4

Say, you have this source ARK file that defines 3 vectors, of 1 timestep and 3 dimensions:

A [ 0.1 0.2 0.3 ]
B [ 2.0 1.2 -0.3 ]
C [ 4.0 0.0 0.1 ]

and this target text file:

C some sentence
A another sentence 

The script onmt-ark-to-records can be used to produce the source TFRecord file containing:

[ 0.1 0.2 0.3 ]
[ 4.0 0.0 0.1 ]

(this file will actually be in binary format)

and the target text file:

another sentence
some sentence

that can be used for the training. The SequenceRecordInputter will read vectors from the TFRecord file and fed them to the encoder (like the word embeddings).

Is that clearer to you? If yes, I will update the documentation with a similar explanation.


(Lockder) #5

almost So from what I understand:

from your explanation:

and this text file:

C some sentence
A another sentence 

The vectors C and A represents a full sentence representation?
it’s like the full encoding using all the words from the sentence to output a tensor and use it for the final encoder?


(Guillaume Klein) #6

Let me clarify that the TFRecord file is the source and the text file is the target. So the training will try to translate the source vector into the target text.

The vectors can represent anything, and it depends on the task for example:

  • actual sequence of word embeddings (equivalent to a text to text training with fixed embeddings, but the source input will be a vector instead of text)
  • audio data (for speech to text, see the model ListenAttendSpell)

(Lockder) #7

aahhh that was the confusion.
I didn’t know the ark file its the source file and the txt is the target output (text)
but looking to the opennmt.bin.ark_to_records the --txt its not a required parameter.


(Lockder) #8

Could be used as feature those TFRecord?


(Guillaume Klein) #9

What do you mean by feature here? An additional vector attached to each word in a sentence?


(Lockder) #10

a vector to the full sentence after all the words are embedded or yes another version would be an additional vector attached to each word in the sentence


(Guillaume Klein) #11

Currently it is only possible to do that at the word level using a parallel inputter:

inputters = [
    onmt.inputters.WordEmbedder(
        vocabulary_file_key="source_vocabulary_1", embedding_size=512),
    onmt.inputters.SequenceRecordInputter()]

inputter = onmt.inputters.ParallelInputter(inputters, reducer=onmt.layers.ConcatReducer()),

(Lockder) #12

To do that then I guess source_vocabulary_1 would have to match keys inside the ark file ?


(Guillaume Klein) #13

No this is unrelated to a ARK key.

http://opennmt.net/OpenNMT-tf/package/opennmt.inputters.text_inputter.html#opennmt.inputters.text_inputter.WordEmbedder

vocabulary_file_key – The data configuration key of the vocabulary file containing one word per line.

So basically your configuration will look like this:

data:
  train_features_file:
    - source_text.txt                              # WordEmbedder input data.
    - vector_features.record                       # SequenceRecordInputter input data.
  source_vocabulary_1: source_text_vocabulary.txt  # For WordEmbedder vocabulary.

(Lockder) #14

I see. the only thing I don’t get its what are the keys inside ARK for.
Because if the target file its the txt. what the conversion from ark to TFRecords is doing. I guess its looking to the sequence of text from the target, go to the ark file for each word and look for the key. then create a binary using a batch with timesteps and the depth for each sentence?


(Guillaume Klein) #15

The ARK file format is coming from Kaldi.

The keys are only used to associate one source vector to one target sentence. After running to ARK to record script, source vectors and target texts will be correctly aligned.


(Lockder) #16

Thank you so much because I know I’m asking a lot but it’s quite confusing to me.

I see so the ark file keys its use to create the alignements targeting one sentence(or sample).
So I guess it’s like and indexed file.
Where the ark file its like a database and the target txt file will be a list of keys matching to the ark file to create and ordered binary. Where each line of the target file used to create the TFRecord will match the index of the sentence(example) inside the WordEmbedder

Steps to train the model:

– Ark Conversion to TFRecods:

ark file for example:

first key ...
... some other keys ...
A [ 0.1 0.2 0.3 ]
C [ 4.0 0.0 0.1 ]
... more keys
end file

target txt for ark conversion to TFRecord:

example 1: A
example 2: C

– Training samples:

  • input english source txt file:
example 1 my name is vanesa
example 2 I love so much sing a song
  • output spanish target txt file:
example 1 mi nombre es vanesa
example 2 me encanta cantar una cancion

using this code would be like having 2 sources.

so I guess the vector depth from the ark file would have to match to the word emb depth?
What I don’t get is how to match the timesteps from each ark vector to each word

inputters = [
onmt.inputters.WordEmbedder(
vocabulary_file_key=“source_vocabulary_1”, embedding_size=512),
onmt.inputters.SequenceRecordInputter()]

inputter = onmt.inputters.ParallelInputter(inputters, reducer=onmt.layers.ConcatReducer()),


(Guillaume Klein) #17

Depends how you are merging them: for a sum they have to be the same, for a concatenation they can be different.

What I don’t get is how to match the timesteps from each ark vector to each word

If you are doing parallel inputs, you should ensure that when preparing the input files.


(Lockder) #18

ok so what I see the TFRecords doesn’t understand batches.
it’s just timesteps in order.
its like recording a movie each frame it’s just one picture.
So if I wanted to match each word for example for

example 1 my name is vanesa

I would need to have a tfrecord target file. where the vectors from the ark txt file match each word by the position

A -> my
B -> name
C -> is
D -> vanesa

would have to match the order of the words inside the input txt file source

so for the second example:

example 2 I love so much sing

E-> I
F-> love
G -> so
H -> much
I-> sing

so the input txt source file would be.

  • example 1 sentence
  • example 2 sentence

but the ark txt target file should be.

  • A
  • B
  • C
  • D
  • E
  • F
  • G
  • H
  • I

then since the TFRecords is ordered will get each row for each new vector?


(Guillaume Klein) #19

No, 1 word = 1 timestep.

So if you have this indexed text file:

A my name is vanesa
B I love so much sing

then you ARK file should like this:

A [
1.0 1.0
1.0 1.0
1.0 1.0
1.0 1.0]
B [
2.0 2.0
2.0 2.0
2.0 2.0
2.0 2.0
2.0 2.0]

Note that A has 4 timesteps and B has 5.


Alternatively, you can manually create the record file to not bother about indices or the ARK representation. To write 1 record, it just takes a 2D vector of shape [time x dimension]:

http://opennmt.net/OpenNMT-tf/package/opennmt.inputters.record_inputter.html#opennmt.inputters.record_inputter.write_sequence_record


(Lockder) #20

so if I had to use this tf records for predicting, I guess on serving time I would have to encode each sentence to and TFRecord and send it as a parallel input, right?
and those words are 2 depth, right?