English Chatbot advice


(higgs) #21

The dataset is cornell movie corpus(222,616 pairs) as I mentioned before.
I separated it into train(183,941), valid(22,162), and test(15,513).
I used 3 layers with 1000 rnn size, and 300 word embedding size.
You can see the whole parameters as below.

th preprocess.lua -train_src data/movie/train.src.txt -train_tgt data/movie/train.dst.txt -valid_src data/movie/valid.src.txt -valid_tgt data/movie/valid.dst.txt -save_data results/movie

th train.lua -gpuid 1 -data results/movie-train.t7 -save_model results/movie-model -layers 3 -rnn_size 1000 -word_vec_size 300 -brnn_merge ‘concat’

And the initial result from epoch 16 is not good either.

[02/07/17 17:44:50 INFO] Loading ‘cv.movie/movie-model_epoch16_186.03.t7’…
[02/07/17 17:45:18 INFO] SENT 1: hello!
[02/07/17 17:45:18 INFO] PRED 1: you know what i mean.
[02/07/17 17:45:18 INFO] PRED SCORE: -8.7295
[02/07/17 17:45:18 INFO] GOLD 1: hello!
[02/07/17 17:45:18 INFO] GOLD SCORE: -14.9353
[02/07/17 17:45:18 INFO]
[02/07/17 17:45:18 INFO] SENT 2: how are you?
[02/07/17 17:45:18 INFO] PRED 2: i don’t know.
[02/07/17 17:45:18 INFO] PRED SCORE: -6.0676
[02/07/17 17:45:18 INFO] GOLD 2: i’m good.
[02/07/17 17:45:18 INFO] GOLD SCORE: -12.0968
[02/07/17 17:45:18 INFO]
[02/07/17 17:45:18 INFO] SENT 3: what’s your name?
[02/07/17 17:45:18 INFO] PRED 3: i don’t know.
[02/07/17 17:45:18 INFO] PRED SCORE: -6.0925
[02/07/17 17:45:18 INFO] GOLD 3: i’m julia.
[02/07/17 17:45:18 INFO] GOLD SCORE: -20.1079
[02/07/17 17:45:18 INFO]
[02/07/17 17:45:18 INFO] SENT 4: when were you born?
[02/07/17 17:45:18 INFO] PRED 4: don’t worry about it. i don’t know what to say.
[02/07/17 17:45:18 INFO] PRED SCORE: -16.6385
[02/07/17 17:45:18 INFO] GOLD 4: july 20th.
[02/07/17 17:45:18 INFO] GOLD SCORE: -19.1772
[02/07/17 17:45:18 INFO]
[02/07/17 17:45:18 INFO] SENT 5: what year were you born?
[02/07/17 17:45:18 INFO] PRED 5: not yet.
[02/07/17 17:45:18 INFO] PRED SCORE: -5.9572
[02/07/17 17:45:18 INFO] GOLD 5: 1977.
[02/07/17 17:45:18 INFO] GOLD SCORE: -6.7442
[02/07/17 17:45:18 INFO]
[02/07/17 17:45:18 INFO] SENT 6: where are you from?
[02/07/17 17:45:18 INFO] PRED 6: i don’t know.
[02/07/17 17:45:18 INFO] PRED SCORE: -5.9520
[02/07/17 17:45:18 INFO] GOLD 6: i’m out in the boonies.
[02/07/17 17:45:18 INFO] GOLD SCORE: -18.8673
[02/07/17 17:45:18 INFO]
[02/07/17 17:45:18 INFO] SENT 7: are you a man or a woman?
[02/07/17 17:45:18 INFO] PRED 7: <unk>
[02/07/17 17:45:18 INFO] PRED SCORE: -5.2376
[02/07/17 17:45:18 INFO] GOLD 7: i’m a woman.
[02/07/17 17:45:18 INFO] GOLD SCORE: -15.3107

(omitted)

From this result, I don’t think that PPL guarantees how good the conversational model is.
I wonder if anyone successfully could reproduce google’s result.
Isn’t there any idea?


(jean.senellart) #22

Thanks. I will give a try too - the fact that there is no answer that have anything to do with source is really weird - it seems that the only ability of the system is to generate reasonable sentence which shows on the PPL.


(jean.senellart) #23

google experiment is using whole opensubtitle has training data which is 62M sentences sentence pair - I will also start on a bigger dataset


(higgs) #24

@jean.senellart
If you can see there is no answer, it is because of “unk” token with brackets.
In this post, bracket is used for blockquote mark.
So don’t put any meaning on no answer in response.

The real problem is unreasonable responses.
If this problem is solved through bigger dataset, it will be awesome.
And PPL is just like measurement score for language model, so it doesn’t give any score how reasonable the response is given a request.
BLEU can be used for machine translation, but I’m not sure.


(jean.senellart) #25

Hi @higgs. (I fixed the <unk> display in your logs)

I built a OpenSubtitle corpus and opened a new tutorial topic for follow-up. We have a huge corpus to experiment :slight_smile: - please contribute !

I will kick off a training on my side with following set-up close to the original set-up:

  • remove all sentences with <unk> in target
  • 2 layers, 4096 LSTM, no attention model.

(higgs) #26

@jean.senellart

Using lager dialog corpus would be a good option since google paper mentioned they used 62M sentences.
Due to your dataset(14M), I am now training again.
But I could not use 2 layers with 4096 LSTM because of “out-of-memory” in GTX Titan(12G mem).
Let’s see the result after several days.


(jean.senellart) #27

on my side, 4096 passes by reducing maximum sentence length (20 in source, 30 in target) which drops less than 2% of the sentences.


(Li hangyu) #28

Are there any online demos for any good-working models? i’d like to try some queries. And i think we should release some good queries for testing ,so one can make a comparison between results of his model and others’


(jean.senellart) #29

you can try the model described here at this url: http://chatbot.mysystran.com.


(Zhong Peixiang) #30

Could you make this tutorial open source into the opennmt? The responses are quite nice i think, at least in terms of grammar.