I'm really interested in knowing your results. I will certainly make my own experimentations on such a subject in a near future.
My idea is: RNN doesn't like repetitions, and, on the decoder side, with only 0/1 codes, you give it a very poor information in the RNN loop to decide of a given value. Adding the words in the output, you will provide it with a richer context, without repetitions : the 0/1 repetitions should then be simple components values in a richer space.
Even if the words are not well predicted, it could be sufficient to produce good 0/1 values. If the words are well predicted, event with a bad number of tokens, you well be able to build a pertinent post-analyze to decide what are the input words to tag with the 0/1 output.