I’m wondering how translation works, when we have word features on the target side (for example the upper case/lower case feature). I was confused by this piece of documentation:
During decoding, the beam search is only applied on the target words space and not on the word features. When the beam path is complete, the associated features are selected along this path."""
How could we predict the most likely complete path in beam search, without knowing the word features of the target sequence that was generated so far?
So an example if I’m translating from English to Dutch:
The city of Paris is a great holiday destination.
The stad Parijs is een geweldige vakantie lokatie.
Then it matters whether I generated
The stad Parijs … or
The stad parijs … so far. So to me it looks like I will need the word features so far in order for me to score this partial path, and also to score the probability of the next word on the path.
Would be great if someone could clarify this :).