OpenNMT Forum

English Chatbot model with OpenNMT

Following this thread initiated by @orph and @higgs - opening a new “tutorial” thread for development on English conversational model based on A Neural Conversational Model,Vinyals, O., & Le, Q. (2015) paper.

Corpus Building

Using OpenSubtitle corpus (Jörg Tiedemann, 2009), English XML dump (19Gb, 323905 movies, 338M sentences) - consecutive sentences with following properties were extracted:

  • first sentence ends with question mark
  • second sentence has no question mark
  • second sentence follows in the movie the first sentence by less than 20 seconds

The extracted pairs were additionally cleaned-up (detokenized, removal sound sequence like (BANG), removal of too complex sentences, normalization of quotes), uniq-ed and shuffled.

This gives a 14M sentence corpus, and 10000 sentences for validation available here. Difference with OpenSubtitle corpus mentioned in the paper is that pairs are questions/answers while in their corpus, it contains any sequence of sentences. Note that there is no certitude that answers are actually answering the questions (answer can come from the same the same speaker, or change topic and/or scene in the movie).

Corpus sample:

What are you doing here, Cao?   To pay Mr. and Mrs. Mak a visit.
But can we cut to what's important here?        There's a Fey Sommers estate out there with your name on it.
I assume it was granted?        He's been given full access - gold level and beyond.
How did you know of Devavrita's return? The entire Universe knows of it my king
What do you think it means?     l appreciate the show and tell, and I don't want you to take this personally, but I work alone.
Who'd have thought it, Venancio?        Us, keeping the Count of Albrit from going hungry... when not even twenty years ago... he was the lord and master of Polan, Jerusa and everything.
Stay close, understand? Keeper
What's toll?    I don't know, they said it.
Let me guess, Kyle?     No, it's Mark actually.
Did you notice what car they were driving?      It's a nightclub, Hal, not a drive-in.
Didn't you?     Yes, we did.
What?   Small and bent of course, deformed, and he hobbles.
Got an easier one?      Just needed to ask you.
He's a nice boy, isn't he?      Yes, he's nice.
How can you possibly hope to stop me?   Well, like the old man said...
Jealous?        I've got more things on my mind.
What a beautiful young voice, how old are you?  Twenty-two.

Here are the evolution of Validation perplexity and some interesting sample outputs from a model trained following the above procedure (5M train, 20482500) after 5 epochs:

opensub_qa_en-5M-30.log:[02/11/17 05:51:37 INFO] Validation perplexity: 31.49
opensub_qa_en-5M-30.log:[02/11/17 20:07:36 INFO] Validation perplexity: 27.95
opensub_qa_en-5M-30.log:[02/12/17 10:26:11 INFO] Validation perplexity: 26.14
opensub_qa_en-5M-30restart.log:[02/15/17 14:50:11 INFO] Validation perplexity: 25.18
opensub_qa_en-5M-30restart.log:[02/16/17 04:52:23 INFO] Validation perplexity: 24.66

Q: When do you sleep?
A: I don’t sleep.

Q: You actually did care for her, didn’t you?
A: Of course I did.

Q: Am I talking to myself?
A: I’m talking to you.

Q: Want to hear what seung-moon is lamenting about?
A: No, I don’t want to hear it.

Q: What do you know about him?
A: I’ve seen him around.

Q: Do you know what he does?
A: He’s a writer.


Q: What are you doing?
A: I’m trying to get you out of here.

Q: Where did you come from?
A: I’ve been looking for you.


Update on the model training which is finished, the evolution of validation perplexity is shown in the following graph:

Thoughts after training:

  • “Yes/yeah”, “no”, “I don’t know” are the recurrent answers for most of the questions amog the test set (500)

  • Short answers are preferred because there are many of them in the training data and they are more likely to have a lower log likelihood score than the longer ones (cumulating negative scores). When we activate the “nbest” flag, we could observe longer and relevant answers arriving at the 2nd or 3rd position.

  • If one wants the system to be more “conversational”, he may consider to bias the beam search. For example, add penalties for short answers while selecting from the possible candidates in the final nbest list

How do you justify the drop in PPL for epoch 9 ?
did you start decay at this epoch ?

I’m still training over the whole dataset from @jean.senellart by reducing rnn_size by 2048 showing 179.35(PPL) on epoch 6.

At the same time, I added a small code into Beam to make the responses diverse by means of using temperature.
With multinomial distribution, it returns various responses.
But I need to make it more reasonable.

I saw some other apporaches such as Mutual Information, Reinforcement Learning, and Memory-based attention.
Anyone has some good experiences on that?

I am still thinking the google paper hides something to show such the results only by means of using vanilla seq2seq learning model with beam search.

hello @higgs, ppl 179 seems super high - how is it decreasing? @DYCSystran results are only on 5M training - but ppl is already far smaller at epoch 6. Did you use anything special for vocabulary? in our system we use a BPE of 32K to avoid any unkown.

For the paper - it does not say how much they had to handpick the examples - we should maybe ask them :slight_smile:?

@jean.senellart The first is PPL over train, and the second over valid.

Epoch1: 18596.83 / 452.51
Epoch2: 436.49 / 255.94
Epoch3: 283.61 / 199.12
Epoch4: 227.58 / 171.71
Epoch5: 197.29 / 155.41
Epoch6: 179.24 / 144.13
Epoch7: 166.51 / (on-going)

1 Like

Good guess! And we should probably have started decay earlier. Here we spent tow extra epochs training to “figure out” that we should start to decay, it’s way more too costly,

We can reduce it to just one extra epoch by using more aggressive decay strategy,

But we need another good early-stopping strategy if we don’t want to waste an entire epoch to know it’s time to decay.

Thank you for your suggestions for beam search.

Could you show me your set of parameters for training (rnn size, embedding size, layers, optim method etc?) The perplexities seem really high.

@DYCSystran Nothing special.

Over 14M sentences

-layers 2
-rnn_size 2048
-word_vec_size 1000
-max_batch_size 64
-end_epoch 5000
-optim ‘adagrad’
-learning_rate 0.1
-dropout 0.3

@jean.senellart How are things going? You must use 2-layered 4096 rnn_size with limited vocabulary(<20).

We usually recommend to stick with SGD which is the default optimization method and always produces good results.

Thank you for sharing the corpus which avoid me to do the dirty work. I think maybe the encoder-decoder-attention architecture are not strong enough to inferring from the source information and give a result, it could do a ok translate job because the source sentence and target sentence have the same meaning, many generations of the conversation system is useless because it need to infer something, which is not just a simple probablistic problem this model could modelling. I’m going to try whether it will be helpful if I combine a QA system with it.

Hi all - if you want to try the model is live here:

Please share in this post the best answers you get!

we are working on forcing the bot to answer with more variety - but there are already a good number of interesting answers!

aurel@Bot: What appeared first between the chicken and the egg?
Bot@aurel: I don't know.
aurel@Bot: What appeared first, chicken or egg?
Bot@aurel: Chicken.

It seems to be quite sarcastic …

aurel@Bot: Could you please help me?
Bot@aurel: I did.
John@Bot: What is your job?
Bot@John: I'm a writer.
GA@Bot: how many languages you speak?
Bot@GA: All of them.
John@Bot: How many books did you write?
Bot@John: One.
GA@Bot: parlez vous français?
Bot@GA: Yes.
GA@Bot: comment dit on bonjour en anglais ?
Bot@GA: Oh, my God.
John@Bot: Aren't you a clown?
Bot@John: No.
John@Bot: To be or not to be?
Bot@John: Not yet.

This looks great! All the responses seem contextually relevant. When training, you might want to also filter out all sentences less than 5 words long. This might help make replies longer and more interesting.

This is additional information on training a more interesting model from this paper.

Thanks @kyang for the link to the paper - I will look at that. We are working on controlling the decoding to avoid the easy answers “yes/no/I don’t know”. The answers are indeed often very appropriate and it is trained only on 5M sentences. We hope even better results on the complete set.

another good and consistent sequence (not on purpose):

John@Bot: what is your favorite colour?
Bot@John: I have no preference.
John@Bot: do you prefer pink or blue?
Bot@John: green.

A bit aggressive Bot for a first time interaction… :open_mouth:

kathy@Bot: I am hungry.
Bot@kathy: No.
kathy@Bot: still here?
Bot@kathy: Yeah.
kathy@Bot: He wants to join you!
Bot@kathy: I’ll kill you!

1 Like