Language Model scorer and sampler

Hello all,

I have commited tools for language models - you can now use lm.lua to score and sample corpus with language models. It supports text with features.

To try it:

  • train a language model:
th preprocess.lua -data_type monotext -train train.tok -valid valid.tok -save_data datalm

if you use features, activate time_shift_feature false.

  • train with -model_type lm

  • score with a language model:

th lm.lua score -model model.t7 -src t

will generate output.txt with perplexity by sentence

  • sample with a language model:
th lm.lua sample -model model.t7 -src t [-temperature T] [-max_seq_len L]

where t contains prefix (seed) of sentences - can be empty. Random sentences up to L words will be generated. The parameter T between 0-1 governs how random should be the generation. The closest to zero, the more consistent and less random.

Any feedback is welcome, and as a teaser, I trained a LM on the corpus from:

and the LM can generate random questions that the chatbot will answer :smile:! Just for the fun a sample:

LM@Bot: Where is my love?
Bot@LM: Tell me.
LM@Bot: Has Ken come back yet?
Bot@LM: no, not yet
LM@Bot: Where do you put eyes on your bitter sack then walsh?
Bot@LM: I don't.
LM@Bot: Did you and Rosemary know each other?
Bot@LM: Yes.
LM@Bot: Can I be of some assistance?
Bot@LM: no, I'm fine.
LM@Bot: who's slow?
Bot@LM: I'm slow.
LM@Bot: what the hell did dave want?
Bot@LM: I don't know.
LM@Bot: Did she say I needed money?
Bot@LM: yeah.
LM@Bot: what the hell have you been up to?
Bot@LM: Nothing.
LM@Bot: Hey, anybody home?
Bot@LM: Hey.
LM@Bot: Where have Magi lived?
Bot@LM: I'll tell you.

would be great to add Importance sampling or NCE, because for very large vocabulary, it is rather infeasible right now on large corpus. see for instance the Google billion word corpus.

great add-on though.


Importance Sampling available in - validation and performance numbers welcome!

1 Like

great will do.
However, on a smaller corpus (where sampling is less critical) given my first results, I have the feeling that we may need variational dropout as opposed to the regular one.

PPL decrease quite quickly but re-increase quite significantly. Tested on PTB.


Quick question, by default is the attention layer active in model type LM mode ?

subsequently can we activate / desactivate attention the same way as for translation ?

Hi Vincent, no - there is no attention layer in the LM (or the seqtagger).

The reason is that the attention model is part of the decoder, it could be possible to introduce a variant of attention though between encoder and generator.

Hi Jean,

Is there any min_seq_len option to generate sentences of at least a minimum specific length?

Thank you!

Hello @Crista23 - no there is none, but this can be added easily: we just need to ignore ‘’ token generation for at least this length number of steps. Let me know if you want to have a try and submit a PR or I can do it.

Is there a way to quickly get scores from the LM for partial hypotheses? I would like to use the LM in a NLG-system and use it for pruning generation hypotheses.

Hello - yes. you can use th lm.lua -mode score, and it should work as-is.

thanks, I’ve noticed that we do not need to put -mode, as this is not an option of lm.lua.
Is there a way to score input coming from STDIN?

thans for the report on -mode, there is actually a discrepancy between doc and cli, I will fix. I just added here a patch so that ‘-’ stands for STDIN - to be used for lm.lua (or translate, tag)

Thanks, that works nicely.
But when I run
th lm.lua score -model /home/knox/nlm/dutchnlm_epoch3_44.03_release.t7 -src testje.txt
several times, I get different scores each time
This also happens when I use STDIN as input method

[vincent@suske /home/suske/openNMT/OpenNMT]$ echo 'de honden blaften' | th lm.lua score -model /home/knox/nlm/dutchnlm_epoch3_44.03_release.t7 -src -
[11/28/17 13:55:49 INFO] Loading '/home/knox/nlm/dutchnlm_epoch3_44.03_release.t7'...	
[11/28/17 13:55:49 INFO] SENT 1: 4.9632294178009	
[vincent@suske /home/suske/openNMT/OpenNMT]$ echo 'de honden blaften' | th lm.lua score -model /home/knox/nlm/dutchnlm_epoch3_44.03_release.t7 -src -
[11/28/17 13:55:51 INFO] Loading '/home/knox/nlm/dutchnlm_epoch3_44.03_release.t7'...	
[11/28/17 13:55:51 INFO] SENT 1: 5.0450329780579	

Any idea why?

No it should not - something is probably not correctly initialised. please open a case on github and I will have a look.

Ok, I’ve opened an issue on github

How did you manage to create the models? I can’t make training work at all. I use

cd /OpenNMT
for f in "${src_train}" "${src_val}"; do th tools/tokenize.lua -segment_numbers < "${f}" > "${f}.tok";done
th preprocess.lua -data_type monotext -train "${src_train}.tok" -valid "${src_val}.tok" -save_data "${fldr}/models/${prefix}"
th train.lua -model_type lm -data ${fldr}/models/${prefix}-train.t7 -save_model ${fldr}/models/${prefix} -gpuid 1

and I get

[05/18/18 12:49:32 INFO] Preallocating memory
/torch/install/bin/luajit: ./onmt/train/Trainer.lua:156: attempt to get length of field 'targetInputFeatures' (a nil value)
stack traceback:
        ./onmt/train/Trainer.lua:156: in function '__init'
        /torch/install/share/lua/5.1/torch/init.lua:91: in function 'new'
        train.lua:332: in function 'main'
        train.lua:338: in main chunk
        [C]: in function 'dofile'
        /torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00405d50

I am using the latest ONMT code (just ran git pull origin master when inside the ONMT folder).

Hi Wiktor, I opened an issue here:


1 Like

Hi @jean.senellart,

I’ve been trying to understand the scores output by lm.lua, and it appears that the score output is actually the loss of the model (, which is the log of the perplexity. Am I missing something, or is it indeed the log of the perplexity?


Hello! I can see this issue is close but I’m still having the same problem when training the LM

/root/torch/bin/luajit: ./onmt/train/Trainer.lua:156: attempt to get length of field 'targetInputFeatures' (a nil value)

stack traceback:

./onmt/train/Trainer.lua:156: in function '__init'

/root/torch/share/lua/5.1/torch/init.lua:91: in function 'new'

train.lua:332: in function 'main'

train.lua:338: in main chunk

[C]: in function 'dofile'

/root/torch/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk

[C]: at 0x00405d50

I’m trying to train with this command

sudo nvidia-docker run -v /home/claudia/sms_old_data_openmt/:/home/data -d opennmt/opennmt:latest th train.lua -gpuid 1 -model_type lm -data /home/data/subs-lm-train.t7 -save_model /home/data/lm-model

Can you please let me know how to solve the error if it is solved?


Please note that language model training, scoring, and sampling are all implemented in OpenNMT-tf (but not LM fusion with seq2seq). You will be get better support there as OpenNMT-lua is now deprecated.