I’m working and learning on the topic of Document-level NMT. I recently read about Doc2vec algorithm to represent documents as embedgings
My question might be naive, but would it be a good idea or even a reasonable idea to add document embedding to a transformer in order to bring some context information ?
I’m not even thinking of a practical implementation but I want to know if it is a stupid assumption to think of Document Level traduction in this way
I haven’t read the publication yet but I guess doc2vec might be difficult to add to a translation model because you need to make an inference to get a document vector for each new document before translation
I don’t know opennmt enough to know if it is reasonable idea to implement, and also if the document vector would not be too costly to compute
Also what do you think of alternatives like Elmo to get word embeddings with sentence context ?
doc2vec provides you an embedding for an entire document (or batch of sentences) capturing in this way the document context information.
You can always get this document representation as a first step before starting the translation process of your document, either with a transformer or a RNN encoder-decoder based system, where you can introduce modifications to take into account this document vector representation.
Also, you can play around with the wide of the context you use, taking into account document entire context or batch context, thinking on how topic can vary inside the same document.
As far as I know, Elmo embeddings only capture sentence context, this is, they ignore inter-sentence information. Recall that the NMT systems do handle inter-sentence context from how they build the source sentence representations before passing this info to the decoder. However, I am not aware of any approach that used ElMo embeddings to check if changing this word representations can help systems to better manage some sentence context information.
I am quite interested in document-level MT so, feel free to write me a pm any time to keep on discussing/talking on this topic.