I know the word embedding has an out of the bucket param(oov) to handle the unknown words.
But since the ark file I guess its like a word embedding m vectors with h depths. What is going to do the TFRecords when trying to do a prediction, the system will receive a m +1 vector (a key outside the embedding) ?
Thanks for the quick answer I don’t fully get it, i find the documentation missing some info to fully understand it.
What means “align vector”?
So I guess if I pass a text ( a sentence ) and a ark file representation. How this is going to be feed to SequenceRecordInputter ?
aahhh that was the confusion.
I didn’t know the ark file its the source file and the txt is the target output (text)
but looking to the opennmt.bin.ark_to_records the --txt its not a required parameter.
I see. the only thing I don’t get its what are the keys inside ARK for.
Because if the target file its the txt. what the conversion from ark to TFRecords is doing. I guess its looking to the sequence of text from the target, go to the ark file for each word and look for the key. then create a binary using a batch with timesteps and the depth for each sentence?
Thank you so much because I know I’m asking a lot but it’s quite confusing to me.
I see so the ark file keys its use to create the alignements targeting one sentence(or sample).
So I guess it’s like and indexed file.
Where the ark file its like a database and the target txt file will be a list of keys matching to the ark file to create and ordered binary. Where each line of the target file used to create the TFRecord will match the index of the sentence(example) inside the WordEmbedder
Steps to train the model:
– Ark Conversion to TFRecods:
ark file for example:
first key ...
... some other keys ...
A [ 0.1 0.2 0.3 ]
C [ 4.0 0.0 0.1 ]
... more keys
target txt for ark conversion to TFRecord:
example 1: A
example 2: C
– Training samples:
input english source txt file:
example 1 my name is vanesa
example 2 I love so much sing a song
output spanish target txt file:
example 1 mi nombre es vanesa
example 2 me encanta cantar una cancion
using this code would be like having 2 sources.
so I guess the vector depth from the ark file would have to match to the word emb depth?
What I don’t get is how to match the timesteps from each ark vector to each word
so if I had to use this tf records for predicting, I guess on serving time I would have to encode each sentence to and TFRecord and send it as a parallel input, right?
and those words are 2 depth, right?