Use for beam search in NMT

KarthikGanesan88 · March 9, 2017, 7:36pm

Hi, Sorry if this is the wrong location for this, but I am using the CTranslate standalone decoder for my work. I have consistently seen mentions of ‘beam search’ being used in NMT but I’m not sure what for.

If someone with expertise could please clarify these for me, i would really appreciate it! I am only doing the decoding part here.

First we take the word and tokenize it. Then we look up the word embedding for it from the table. Then we pass it to the encoder RNN ( where the RNN width == size of the word vector ).
Then once the EOS tag is reached the the decoder RNN starts outputting a vector with every time step. We then look up this vector in the word vector table ( I assume this isnt an ‘exact’ search and we find the nearest word to the produced word. ). Then we detokenize this to generate the actual word.

I’m not sure where the beam search comes into this.

srush · March 9, 2017, 7:50pm

Maybe slide 29 of my talk on this would help? This shows how the beam search process is used.

https://harvardnlp.github.io/seq2seq-talk/slidescmu.pdf

Basically instead of outputting the highest scoring word, you keep around the K highest probability target outputs generated so far.

KarthikGanesan88 · March 9, 2017, 9:10pm

Hi! Thank you for your quick reply! This forum is absolutely the best I’ve seen for that anywhere

I get it a bit more but im still not clear about one detail. So at every time step, the RNN generates a N-dimensional output. So we search the target embedding space for the word that best matches this? then based on the size of the beam search we store the top-M of these?

So to generate a translation, how is each word in the target language generated? is it done one word at a time or do we wait for the decoder RNN to output and then do a backwards traversal of the beam search tree to find the most probably output sentence?

And I guess a simple follow on, if I just use a beam search size of 1, would i then be able to just generate the most likely word every time step of the decoder RNN?

Thank you again!

guillaumekln · March 11, 2017, 10:27am

The words are generated one by one. At each step, the model produces the likelihood of each word in the target vocabulary.

With a beam of size 1, you just pick the most likely word to generate next.

With a beam of size K, the K partial hypotheses will each produce the likelihood of each word in the target vocabulary. You then keep the K most likely words considering the prediction of each partial hypothesis.

When the top of the beam generates </s> (the end of sentence token), you select the beam with the best score (i.e. the highest cumulative likelihood) and you do the final backward traversal of the beam search to retrieve the full hypothesis.

KarthikGanesan88 · March 11, 2017, 3:26pm

Hi @guillaumekln , thank you! If i could get a little more clarification from you, that would be very helpful

For the beam search of N>1, i would have to store the partial hypotheses of all the previous words in the sentence correct? but once I have hit the token, how would i know which was the most likely path through the search space?

so is there the probability stored of every word following every other possible word in the search space? or is there some other way of generating this ‘most likely’ sentence from the list of all possibly hypotheses?

thank you!

guillaumekln · March 11, 2017, 7:34pm

The probability of a hypothesis is the sum of the probability of each word included in this hypothesis. So basically you update a score table at each timestep with the predicted probabilities.

KarthikGanesan88 · March 11, 2017, 8:19pm

Sorry, I am still not getting something. If the final sentence is just the sequence of words with highest probability at each timestep would that not just give the same result as the beam search of size 1?

so if i take a concrete example, if i predict a sentence with 3 words. so my beam search table should like something like this.

Word (probability)
At timestep 1 A (0.9) B (0.8) C(0.7)
At timestep 2 D (0.9) E (0.8) F(0.7)
At timestep 3 H (0.9) I (0.8) J(0.7)

So wouldnt i just always predict A,D,H? so once i have this table built how do i then determine that some other sentence ( such as B,E,I ) was actually a better translation than A,D,H?

Thank you!

guillaumekln · March 11, 2017, 10:22pm

It’s like finding the shortest path in a graph. You can follow an edge that costs you more initially but it eventually leads to the shortest path.

The slides shared by @srush have a detailed description of the beam search. Maybe they are less ambiguous than words.