Sorry, I am still not getting something. If the final sentence is just the sequence of words with highest probability at each timestep would that not just give the same result as the beam search of size 1?
so if i take a concrete example, if i predict a sentence with 3 words. so my beam search table should like something like this.
At timestep 1 A (0.9) B (0.8) C(0.7)
At timestep 2 D (0.9) E (0.8) F(0.7)
At timestep 3 H (0.9) I (0.8) J(0.7)
So wouldnt i just always predict A,D,H? so once i have this table built how do i then determine that some other sentence ( such as B,E,I ) was actually a better translation than A,D,H?