Thread on project gitter summarized here:
we’re currently experimenting with adding more chat context and different kinds of context to improve results… seq2seq is said to favor the tail end which is helpful here. it seems to be working, though we don’t have a good measure yet beyond the proportion of response suggestions accepted without edit.
using the history (char context) is a cool idea… I am wondering if we could not optimize by having an independent encoder for the history and keeping only the context vector of the last word of the history. Which would mean virtually unlimited history size.
another thing people do is hierarchical encoder HRED (https://arxiv.org/abs/1507.02221)
although that might not save mem
I have been thinking of use previous sentences history for regular translation task, the limitation is that most of our training data does not have this “history” (the way most translation memory are built are not keeping sentence order or even full document) - but it could be very interesting. The document encoder would pass one of the full document (and not only previous sentences), and we just use the “document vector” as additional input for the decoding
I will be interested to test on that idea
just my two cents - this idea could be used to translate longer documents instead of just phrase to phrase
wrt to history, if the datasets are not shuffled and if there is consistency between sentences, could we have a fixed length sequence (eg 100 or more) that would “slide” over the dataset considered as a stream instead of sentences eos separated ?
That’s what people do for language modeling. It would be hard to incorporate the attention mechanism in there, but not impossible