What type of representation is used by OpenNMT?


(Negacy Hailu) #1

How does OpenNMT represent words (or characters) of input sequence? Is it one hot code encoding or does it compute embedding internally? Assuming, I have a French to English parallel sentences, I am interested to know how the words of the input(French sentences) for this particular example are represented by the tool.


(Guillaume Klein) #2

The training optimizes an embedding for each word, as you’d expect.

Does that answer your question?


(Negacy Hailu) #3

Are you saying it learns word embeddings automatically? How about if the translation is character based? Does it learn character embedding? The tool allows to load pre-trained character embedding for both source and target language, and I was wondering if there is no pre-trained characters embedding, how the characters are represented?


(Guillaume Klein) #4

I should have used “for each token” in my sentence. A token can be a word, a subword, or a character depending on the tokenization.

If no pretrained embeddings are given, the vector is initialized with random values which are optimized as part of the training (like other model parameters).


(Negacy Hailu) #5

Ok, it kind of makes sense, but in your sentence “If no pretrained embeddings are given, the vector is initialized with random values which are optimized as part of the training”, what is a vector? Is it the representation for each token? So, my understanding is that OpenNMT uses token/character/word embedding depending the tokenization that are learned automatically during training, is that correct or am I missing something?


(Guillaume Klein) #6

Yes, it is the learned representation of the token.


(Negacy Hailu) #7

Thanks, that was helpful.