I have commited tools for language models - you can now use
lm.lua to score and sample corpus with language models. It supports text with features.
To try it:
- train a language model:
th preprocess.lua -data_type monotext -train train.tok -valid valid.tok -save_data datalm
if you use features, activate
score with a language model:
th lm.lua score -model model.t7 -src t
output.txt with perplexity by sentence
- sample with a language model:
th lm.lua sample -model model.t7 -src t [-temperature T] [-max_seq_len L]
where t contains prefix (seed) of sentences - can be empty. Random sentences up to L words will be generated. The parameter
T between 0-1 governs how random should be the generation. The closest to zero, the more consistent and less random.
Any feedback is welcome, and as a teaser, I trained a LM on the corpus from:
and the LM can generate random questions that the chatbot will answer ! Just for the fun a sample:
LM@Bot: Where is my love? Bot@LM: Tell me. LM@Bot: Has Ken come back yet? Bot@LM: no, not yet LM@Bot: WHATS HITCHKOK EXACTLY? Bot@LM: I DON'T KNOW. LM@Bot: Where do you put eyes on your bitter sack then walsh? Bot@LM: I don't. LM@Bot: Did you and Rosemary know each other? Bot@LM: Yes. LM@Bot: Can I be of some assistance? Bot@LM: no, I'm fine. LM@Bot: who's slow? Bot@LM: I'm slow. LM@Bot: what the hell did dave want? Bot@LM: I don't know. LM@Bot: Did she say I needed money? Bot@LM: yeah. LM@Bot: what the hell have you been up to? Bot@LM: Nothing. LM@Bot: Hey, anybody home? Bot@LM: Hey. LM@Bot: Where have Magi lived? Bot@LM: I'll tell you.