I know the word embedding has an out of the bucket param(oov) to handle the unknown words.
But since the ark file I guess its like a word embedding m vectors with h depths. What is going to do the TFRecords when trying to do a prediction, the system will receive a m +1 vector (a key outside the embedding) ?
Thanks for the quick answer I don’t fully get it, i find the documentation missing some info to fully understand it.
What means “align vector”?
So I guess if I pass a text ( a sentence ) and a ark file representation. How this is going to be feed to SequenceRecordInputter ?
Say, you have this source ARK file that defines 3 vectors, of 1 timestep and 3 dimensions:
A [ 0.1 0.2 0.3 ]
B [ 2.0 1.2 -0.3 ]
C [ 4.0 0.0 0.1 ]
and this target text file:
C some sentence
A another sentence
The script onmt-ark-to-records can be used to produce the source TFRecord file containing:
[ 0.1 0.2 0.3 ]
[ 4.0 0.0 0.1 ]
(this file will actually be in binary format)
and the target text file:
another sentence
some sentence
that can be used for the training. The SequenceRecordInputter will read vectors from the TFRecord file and fed them to the encoder (like the word embeddings).
Is that clearer to you? If yes, I will update the documentation with a similar explanation.
and this text file:
C some sentence
A another sentence
The vectors C and A represents a full sentence representation?
it’s like the full encoding using all the words from the sentence to output a tensor and use it for the final encoder?
Let me clarify that the TFRecord file is the source and the text file is the target. So the training will try to translate the source vector into the target text.
The vectors can represent anything, and it depends on the task for example:
actual sequence of word embeddings (equivalent to a text to text training with fixed embeddings, but the source input will be a vector instead of text)
aahhh that was the confusion.
I didn’t know the ark file its the source file and the txt is the target output (text)
but looking to the opennmt.bin.ark_to_records the --txt its not a required parameter.
a vector to the full sentence after all the words are embedded or yes another version would be an additional vector attached to each word in the sentence
I see. the only thing I don’t get its what are the keys inside ARK for.
Because if the target file its the txt. what the conversion from ark to TFRecords is doing. I guess its looking to the sequence of text from the target, go to the ark file for each word and look for the key. then create a binary using a batch with timesteps and the depth for each sentence?
The keys are only used to associate one source vector to one target sentence. After running to ARK to record script, source vectors and target texts will be correctly aligned.
Thank you so much because I know I’m asking a lot but it’s quite confusing to me.
I see so the ark file keys its use to create the alignements targeting one sentence(or sample).
So I guess it’s like and indexed file.
Where the ark file its like a database and the target txt file will be a list of keys matching to the ark file to create and ordered binary. Where each line of the target file used to create the TFRecord will match the index of the sentence(example) inside the WordEmbedder
Steps to train the model:
– Ark Conversion to TFRecods:
ark file for example:
first key ...
... some other keys ...
A [ 0.1 0.2 0.3 ]
C [ 4.0 0.0 0.1 ]
... more keys
end file
target txt for ark conversion to TFRecord:
example 1: A
example 2: C
– Training samples:
input english source txt file:
example 1 my name is vanesa
example 2 I love so much sing a song
output spanish target txt file:
example 1 mi nombre es vanesa
example 2 me encanta cantar una cancion
using this code would be like having 2 sources.
so I guess the vector depth from the ark file would have to match to the word emb depth?
What I don’t get is how to match the timesteps from each ark vector to each word
ok so what I see the TFRecords doesn’t understand batches.
it’s just timesteps in order.
its like recording a movie each frame it’s just one picture.
So if I wanted to match each word for example for
example 1 my name is vanesa
I would need to have a tfrecord target file. where the vectors from the ark txt file match each word by the position
A -> my
B -> name
C -> is
D -> vanesa
would have to match the order of the words inside the input txt file source
so for the second example:
example 2 I love so much sing
E-> I
F-> love
G -> so
H -> much
I-> sing
so the input txt source file would be.
example 1 sentence
example 2 sentence
but the ark txt target file should be.
A
B
C
D
E
F
G
H
I
then since the TFRecords is ordered will get each row for each new vector?
Alternatively, you can manually create the record file to not bother about indices or the ARK representation. To write 1 record, it just takes a 2D vector of shape [time x dimension]:
so if I had to use this tf records for predicting, I guess on serving time I would have to encode each sentence to and TFRecord and send it as a parallel input, right?
and those words are 2 depth, right?