What type of representation is used by OpenNMT?

negacy · October 8, 2018, 7:45pm

How does OpenNMT represent words (or characters) of input sequence? Is it one hot code encoding or does it compute embedding internally? Assuming, I have a French to English parallel sentences, I am interested to know how the words of the input(French sentences) for this particular example are represented by the tool.

guillaumekln · October 9, 2018, 10:25am

The training optimizes an embedding for each word, as you’d expect.

Does that answer your question?

negacy · October 9, 2018, 12:19pm

Are you saying it learns word embeddings automatically? How about if the translation is character based? Does it learn character embedding? The tool allows to load pre-trained character embedding for both source and target language, and I was wondering if there is no pre-trained characters embedding, how the characters are represented?

guillaumekln · October 9, 2018, 12:25pm

I should have used “for each token” in my sentence. A token can be a word, a subword, or a character depending on the tokenization.

If no pretrained embeddings are given, the vector is initialized with random values which are optimized as part of the training (like other model parameters).

negacy · October 9, 2018, 12:34pm

Ok, it kind of makes sense, but in your sentence “If no pretrained embeddings are given, the vector is initialized with random values which are optimized as part of the training”, what is a vector? Is it the representation for each token? So, my understanding is that OpenNMT uses token/character/word embedding depending the tokenization that are learned automatically during training, is that correct or am I missing something?

guillaumekln · October 9, 2018, 12:37pm

Yes, it is the learned representation of the token.

negacy · October 9, 2018, 12:39pm

Thanks, that was helpful.