In my experience using attention argmax() for replacement often fails with deeper networks and copies the wrong token from the source sequence. Does anyone know of some simple alternative techniques? Perhaps something based on word embedding of words surrounding , or anything like that? Or maybe some heuristic to modify the blind argmax() token copying?
In my experiment, substitution based on attention works fine.
My model : embedding size 500, hidden size 1000, single rnn layer.
It worked ok for size 500 (both rnn and embeddings) 2 layers, but once I upgraded to 4 layers of size 1000 (again both rnn and embeddings), the substitution started to miss things a lot, usually substituting a word right next to the correect one (although BLEU score improved).
I copy @arebollo-systran and @jungikim who ran tons of experiments on that: as it is today the
<UNK> substitution based on attention softmax is very random. For the same network, same data size - it can work well or … not. And when it does not, it is quite frustrating because it generally picks some word close to the unknown. When looking at probability distribution - they are very close, so it is just “bad luck”.
We have been trying several things to improve like guided alignements or trying to discourage use of
<UNK> in target - but so far no generic perfect solution.
Lot of people (including us ) are working around by using BPE which has other types of problems.
We are exploring tracks on coverage modelling - any idea is welcome !
see It will be better if there is a Coverage Attention Module, or How to deal with repetition?.
my 50 cents on this topics: What I have seen with the datasets I have been working on is that attention gives very poor results; success rates can be a low as 3% or less for attention based model. Challenge is that if I have small set of samples attention model apparently does not work at all. I think when substituting UNKs it should focus ONLY on unknown words. If I have 3 unknowns on long centence and just 3% chance to hit correct word, better method would be to create array of those three words and use simple rand to pick any one of them randomly and we would get 33% chance to get it correct
Of course, focusing on unknown words can bring good results in many cases. But, for several reasons, it can also failed:
- since you are working with limited vocabs, depending on the way you built them, a word can be in a dict of one language and not in the other.
- you can have an unknown word in the input sentence, and ONMT will construct a translation with only known words, or the contrary.
- an unknown rare word in a language can have a translation built from very current words in the other
- there is often a N-words to M-words match to find.