Beam Search with target words features: how does it work?

Hi all,

I’m wondering how translation works, when we have word features on the target side (for example the upper case/lower case feature). I was confused by this piece of documentation:
During decoding, the beam search is only applied on the target words space and not on the word features. When the beam path is complete, the associated features are selected along this path."""

How could we predict the most likely complete path in beam search, without knowing the word features of the target sequence that was generated so far?

So an example if I’m translating from English to Dutch:
The city of Paris is a great holiday destination.
into (ideally):
The stad Parijs is een geweldige vakantie lokatie.

Then it matters whether I generated
The stad Parijs … or
The stad parijs … so far. So to me it looks like I will need the word features so far in order for me to score this partial path, and also to score the probability of the next word on the path.

Would be great if someone could clarify this :).
Cheers, Karlijn


The features are generated at each step and fed as inputs to the next, but their probability are not taken into account to compute the beam path. Does that make sense?


Thanks for the quick reply!
So, yes I think it makes sense to me now, to recap in the example:

If we generated so far
w0, w1, w2 = de stad parijs
and the corresponding target word case-features
f0, f1 = upper, lower
(my understanding is we didn’t predict the target word feature ‘upper case’ for Parijs here yet).

Then in the next timestep we’ll use as input (w0, w1, w2, f0, f1), and we’ll output
w3 and f2 (and we’ll choose w3 irrespective of the fact whether it’ll turn out the get an upper or lower case feature f3 attached to it in the next tilmestep).

That’s correct.