Language Model scorer and sampler

jean.senellart · June 13, 2017, 8:25am

Hello all,

I have commited tools for language models - you can now use lm.lua to score and sample corpus with language models. It supports text with features.

To try it:

train a language model:

th preprocess.lua -data_type monotext -train train.tok -valid valid.tok -save_data datalm

if you use features, activate time_shift_feature false.

train with -model_type lm
score with a language model:

th lm.lua score -model model.t7 -src t

will generate output.txt with perplexity by sentence

sample with a language model:

th lm.lua sample -model model.t7 -src t [-temperature T] [-max_seq_len L]

where t contains prefix (seed) of sentences - can be empty. Random sentences up to L words will be generated. The parameter T between 0-1 governs how random should be the generation. The closest to zero, the more consistent and less random.

Any feedback is welcome, and as a teaser, I trained a LM on the corpus from:

and the LM can generate random questions that the chatbot will answer ! Just for the fun a sample:

LM@Bot: Where is my love?
Bot@LM: Tell me.
LM@Bot: Has Ken come back yet?
Bot@LM: no, not yet
LM@Bot: WHATS HITCHKOK EXACTLY?
Bot@LM: I DON'T KNOW.
LM@Bot: Where do you put eyes on your bitter sack then walsh?
Bot@LM: I don't.
LM@Bot: Did you and Rosemary know each other?
Bot@LM: Yes.
LM@Bot: Can I be of some assistance?
Bot@LM: no, I'm fine.
LM@Bot: who's slow?
Bot@LM: I'm slow.
LM@Bot: what the hell did dave want?
Bot@LM: I don't know.
LM@Bot: Did she say I needed money?
Bot@LM: yeah.
LM@Bot: what the hell have you been up to?
Bot@LM: Nothing.
LM@Bot: Hey, anybody home?
Bot@LM: Hey.
LM@Bot: Where have Magi lived?
Bot@LM: I'll tell you.

vince62s · June 13, 2017, 9:32pm

would be great to add Importance sampling or NCE, because for very large vocabulary, it is rather infeasible right now on large corpus. see for instance the Google billion word corpus.

great add-on though.

thanks!

jean.senellart · June 14, 2017, 1:02pm

Importance Sampling available in https://github.com/OpenNMT/OpenNMT/pull/318 - validation and performance numbers welcome!

vince62s · June 14, 2017, 1:19pm

Jean,
great will do.
However, on a smaller corpus (where sampling is less critical) given my first results, I have the feeling that we may need variational dropout as opposed to the regular one.

PPL decrease quite quickly but re-increase quite significantly. Tested on PTB.

vince62s · June 25, 2017, 10:36am

@jean.senellart,

Quick question, by default is the attention layer active in model type LM mode ?

subsequently can we activate / desactivate attention the same way as for translation ?

jean.senellart · June 25, 2017, 9:34pm

Hi Vincent, no - there is no attention layer in the LM (or the seqtagger).

The reason is that the attention model is part of the decoder, it could be possible to introduce a variant of attention though between encoder and generator.

Crista23 · August 27, 2017, 10:10pm

Hi Jean,

Is there any min_seq_len option to generate sentences of at least a minimum specific length?

Thank you!

jean.senellart · August 28, 2017, 7:02am

Hello @Crista23 - no there is none, but this can be added easily: we just need to ignore ‘’ token generation for at least this length number of steps. Let me know if you want to have a try and submit a PR or I can do it.

vincent · November 16, 2017, 10:16am

Is there a way to quickly get scores from the LM for partial hypotheses? I would like to use the LM in a NLG-system and use it for pruning generation hypotheses.

jean.senellart · November 20, 2017, 10:12pm

Hello - yes. you can use th lm.lua -mode score, and it should work as-is.

vincent · November 21, 2017, 11:07am

thanks, I’ve noticed that we do not need to put -mode, as this is not an option of lm.lua.
Is there a way to score input coming from STDIN?

jean.senellart · November 21, 2017, 1:25pm

thans for the report on -mode, there is actually a discrepancy between doc and cli, I will fix. I just added here a patch so that ‘-’ stands for STDIN - to be used for lm.lua (or translate, tag)

vincent · November 28, 2017, 12:56pm

Thanks, that works nicely.
But when I run
th lm.lua score -model /home/knox/nlm/dutchnlm_epoch3_44.03_release.t7 -src testje.txt
several times, I get different scores each time
This also happens when I use STDIN as input method

[vincent@suske /home/suske/openNMT/OpenNMT]$ echo 'de honden blaften' | th lm.lua score -model /home/knox/nlm/dutchnlm_epoch3_44.03_release.t7 -src -
[11/28/17 13:55:49 INFO] Loading '/home/knox/nlm/dutchnlm_epoch3_44.03_release.t7'...	
[11/28/17 13:55:49 INFO] SENT 1: 4.9632294178009	
[vincent@suske /home/suske/openNMT/OpenNMT]$ echo 'de honden blaften' | th lm.lua score -model /home/knox/nlm/dutchnlm_epoch3_44.03_release.t7 -src -
[11/28/17 13:55:51 INFO] Loading '/home/knox/nlm/dutchnlm_epoch3_44.03_release.t7'...	
[11/28/17 13:55:51 INFO] SENT 1: 5.0450329780579

Any idea why?

jean.senellart · November 28, 2017, 8:29pm

No it should not - something is probably not correctly initialised. please open a case on github and I will have a look.

vincent · November 29, 2017, 12:26pm

Ok, I’ve opened an issue on github

wiktor.stribizew · May 18, 2018, 12:53pm

How did you manage to create the models? I can’t make training work at all. I use

cd /OpenNMT
for f in "${src_train}" "${src_val}"; do th tools/tokenize.lua -segment_numbers < "${f}" > "${f}.tok";done
th preprocess.lua -data_type monotext -train "${src_train}.tok" -valid "${src_val}.tok" -save_data "${fldr}/models/${prefix}"
th train.lua -model_type lm -data ${fldr}/models/${prefix}-train.t7 -save_model ${fldr}/models/${prefix} -gpuid 1

and I get

[05/18/18 12:49:32 INFO] Preallocating memory
/torch/install/bin/luajit: ./onmt/train/Trainer.lua:156: attempt to get length of field 'targetInputFeatures' (a nil value)
stack traceback:
        ./onmt/train/Trainer.lua:156: in function '__init'
        /torch/install/share/lua/5.1/torch/init.lua:91: in function 'new'
        train.lua:332: in function 'main'
        train.lua:338: in main chunk
        [C]: in function 'dofile'
        /torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405d50

I am using the latest ONMT code (just ran git pull origin master when inside the ONMT folder).

jean.senellart · May 23, 2018, 10:17am

Hi Wiktor, I opened an issue here:

Best
Jan

avinash_v · July 29, 2018, 8:04am

Hi @jean.senellart,

I’ve been trying to understand the scores output by lm.lua, and it appears that the score output is actually the loss of the model (https://github.com/OpenNMT/OpenNMT/blob/master/onmt/lm/LM.lua#L142), which is the log of the perplexity. Am I missing something, or is it indeed the log of the perplexity?

Thanks
Avinash

cmatosv · April 26, 2019, 9:42am

Hello! I can see this issue is close but I’m still having the same problem when training the LM

/root/torch/bin/luajit: ./onmt/train/Trainer.lua:156: attempt to get length of field 'targetInputFeatures' (a nil value)

stack traceback:

./onmt/train/Trainer.lua:156: in function '__init'

/root/torch/share/lua/5.1/torch/init.lua:91: in function 'new'

train.lua:332: in function 'main'

train.lua:338: in main chunk

[C]: in function 'dofile'

/root/torch/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk

[C]: at 0x00405d50

I’m trying to train with this command

sudo nvidia-docker run -v /home/claudia/sms_old_data_openmt/:/home/data -d opennmt/opennmt:latest th train.lua -gpuid 1 -model_type lm -data /home/data/subs-lm-train.t7 -save_model /home/data/lm-model

Can you please let me know how to solve the error if it is solved?
Regards
Claudia

guillaumekln · April 28, 2019, 8:36pm

Hi,

Please note that language model training, scoring, and sampling are all implemented in OpenNMT-tf (but not LM fusion with seq2seq). You will be get better support there as OpenNMT-lua is now deprecated.