End-to-End Speech Recognition using PDBRNN in OpenNMT

homink · October 18, 2017, 7:26pm

This tutorial describes how to perform End-to-End English speech recognition using Pyramidal Deep Bidirectional Encoder in OpenNMT

1. Data preparation

OpenNMT requires source and target for training. For the speech recognition, source is a sequence of acoustic feature and target is a sequence of characters (i.e. transcription). Here we make use of Wall Street Journal (WSJ) speech corpus where about 70 hours of 16KHz English speech are available. The following Github repository helps to build training, dev, and test set of source and target from from the WSJ corpus using a Kaldi speech recognition toolkit.

2. Lua 5.2 installation

A sequence of acoustic feature is quite heavy to be running completely. Lua 5.2 helps to load such large data in OpenNMT. LuaJIT user will need to limit the length of source and target sequences. You can install Lua 5.2 as shown in the following.

git clone https://github.com/torch/distro.git ~/torch --recursive 
cd torch
TORCH_LUA_VERSION=LUA52 ./install.sh

3. Training

th preprocess.lua -data_type 'feattext' \
  -train_src wsj_train_fbank_fbank120.txt \
  -train_tgt wsj_train_fbank_trans_nm.txt \
  -valid_src wsj_dev_fbank_fbank120.txt \
  -valid_tgt wsj_dev_fbank_trans_nm.txt \
  -save_data wsj_kaldi_nm_fb120 \
  -idx_files -src_seq_length 2450 -tgt_seq_length 330

NAME=asr_wsj_pdbrnn_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm
CUDA_VISIBLE_DEVICES=0 \
  ~/torch/install/bin/th train.lua \
  -data wsj_kaldi_nm_fb120-train.t7 \
  -save_model $NAME -encoder_type pdbrnn -report_every 1 \
  -rnn_size 256 -word_vec_size 50 -enc_layers 3 -dec_layers 3 -max_batch_size 6 \
  -learning_rate 0.2 -dropout_type naive -dropout 0 \
  -learning_rate_decay 1 -start_decay_at 10 -end_epoch 30 -pdbrnn_merge concat \
  -residual false -gpuid 1

OpenNMT supports training feature called Scheduled Sampling, which can be configured in the following. You can set parameter values depending upon your preference. Here, I set inverse sigmoid up to epoch 30.

https://papers.nips.cc/paper/5956-scheduled-sampling-for-sequence-prediction-with-recurrent-neural-networks.pdf

NAME=asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm
CUDA_VISIBLE_DEVICES=0 \
  ~/torch/install/bin/th train.lua \
  -data wsj_kaldi_nm_fb120-train.t7 \
  -save_model $NAME -encoder_type pdbrnn -report_every 1 \
  -rnn_size 256 -word_vec_size 50 -enc_layers 3 -dec_layers 3 -max_batch_size 6 \
  -learning_rate 0.2 -dropout_type naive -dropout 0 \
  -learning_rate_decay 1 -start_decay_at 10 -end_epoch 25 -pdbrnn_merge concat \
  -scheduled_sampling 0.99 -scheduled_sampling_decay_rate 7.5 \
  -scheduled_sampling_scope token -scheduled_sampling_decay_type invsigmoid \
  -residual false -gpuid 1

SSC

4. Prediction

NAME=asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm
src_file=wsj_test_fbank_fbank120.txt
tgt_file=wsj_test_fbank_trans_nm.txt;
wer_file=$RESULT/WER_$NAME.txt
gold_file=$(mktemp /tmp/gold.XXXXXX);pred_file=$(mktemp /tmp/pred.XXXXXX)

[ -e $wer_file ] && rm -f $wer_file
for ((epoch_cn=1;epoch_cn<=25;epoch_cn++))
do
  model=$(ls $NAME"_"epoch$epoch_cn"_"*)
  output=$NAME"_"epoch$epoch_cn.out
  full_output=$NAME"_"epoch$epoch_cn.out.log
  echo $model >> $wer_file

  CUDA_VISIBLE_DEVICES=0 \
  ~/torch/install/bin/th translate.lua \
  -src $src_file -tgt $tgt_file -model $model \
  -output $output -batch_size 1 -gpuid 1 -idx_files true > $full_output

  grep GOLD $full_output | grep -v SCORE | perl -pe 's/.*: //g;s/ //g;s/_/ /g' > $gold_file
  grep PRED $full_output | grep -v SCORE | perl -pe 's/.*: //g;s/ //g;s/_/ /g' > $pred_file

  python wer++.py $pred_file $gold_file >> $wer_file
done

WER is obtained using wer++.py. As epoch increases, the prediction of 11th speech utterance below is improved.

>> grep -w "GOLD 11" *_nm_epoch1.out.log
[09/15/17 22:51:48 INFO] GOLD 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s

>> grep -w "PRED 11" *_nm_epoch*
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch1.out.log:
[09/15/17 22:51:48 INFO] PRED 11: t h e _ w a r l i n g _ t r e n d _ m a y _ h a v e _ m o t e d _ t h e _ s k i l l _ c o v e r o n _ s o m e _ c r o b e s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch2.out.log:
[09/15/17 23:04:01 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch3.out.log:
[09/15/17 23:15:28 INFO] PRED 11: t h e _ w o r m i n g _ t r e n d _ m a y _ h a v e _ m o t e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch4.out.log:
[09/15/17 23:26:56 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch5.out.log:
[09/16/17 22:23:14 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch6.out.log:
[09/16/17 22:34:29 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m o t e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch7.out.log:
[09/16/17 22:45:56 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch8.out.log:
[09/16/17 22:57:26 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l i t i d _ t h e _ s n o c k o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch9.out.log:
[09/17/17 21:42:58 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e a l t e d _ t h e _ s n o w _ c o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch10.out.log:
[09/17/17 21:54:22 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h e _ s k n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch11.out.log:
[09/17/17 22:05:50 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h i s _ n o _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch12.out.log:
[09/17/17 22:17:08 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h i s _ n o _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch13.out.log:
[09/18/17 11:20:39 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch14.out.log:
[09/18/17 11:31:49 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch15.out.log:
[09/19/17 10:04:05 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h e _ s n o _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch16.out.log:
[09/19/17 10:15:24 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch17.out.log:
[09/19/17 10:27:02 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch18.out.log:
[09/19/17 10:38:35 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch19.out.log:
[09/19/17 15:05:39 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l v e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch20.out.log:
[09/19/17 22:57:40 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l v e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch21.out.log:
[09/25/17 11:03:35 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch22.out.log:
[09/25/17 11:15:53 INFO] PRED 11: t h e _ w o r m i n g _ t r e n d _ m a y _ h a v e _ m e l v e d _ t h i s _ n o _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch23.out.log:
[09/25/17 11:31:32 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch24.out.log:
[09/25/17 11:43:13 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e a l t e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s
asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm_epoch25.out.log:
[09/25/17 11:54:25 INFO] PRED 11: t h e _ w a r m i n g _ t r e n d _ m a y _ h a v e _ m e l t e d _ t h e _ s n o w _ c o v e r _ o n _ s o m e _ c r o p s

WER

5. LM Shallow Fusion

OpenNMT supports decoding feature called the LM Shallow Fusion which improves further. LM needs to be trained first with the dictionary produced from the data preparation above and it can be configured as follows. Here we made use of in-house English corpus 2M and 20M TUs.

5.1 LM training

~/torch/install/bin/th preprocess.lua \
  -data_type monotext \
  -train /CI_news-uniq.en.tok \
  -seq_length 700 \
  -valid CI_news-dev.en.tok \
  -save_data CI_news-uniq.en.tok \
  -vocab wsj_kaldi_fb120.tgt.dict

CUDA_VISIBLE_DEVICES=0 \
~/torch/install/bin/th train.lua \
  -data CI_news-uniq.en.tok-train.t7 \
  -save_model CI_news-uniq.en.tok_lm_r1024_wv500_l2 \
  -model_type lm -report_every 1 -max_batch_size 300 -end_epoch 35 \
  -rnn_size 1024 -word_vec_size 500 -layers 2 \
  -learning_rate 0.1 -learning_rate_decay 1.0 \
  -gpuid 1

5.2 Decoding with the Shallow Fusion

NAME=asr_wsj_pdbrnn_0.99_7.5_token_invsigmoid_fb120_pdbrnn_E3_D3_rnn256_sgd0.2_naive0_nm
src_file=wsj_test_fbank_fbank120.txt
tgt_file=wsj_test_fbank_trans_nm.txt;
wer_file=WER_$NAME.txt
gold_file=$(mktemp /tmp/gold.XXXXXX);pred_file=$(mktemp /tmp/pred.XXXXXX)

[ -e $wer_file ] && rm -f $wer_file
for epoch_cn in 25
do
  model=$(ls $NAME"_"epoch$epoch_cn"_"*)
  lw=0.000;incl=0.003
  for iter in {1..80..1}
  do
    lw=`echo $lw + $incl | bc`
    output=$NAME"_"epoch$epoch_cn.out
    full_output=$NAME"_"epoch$epoch_cn.out.log
    echo $model"_"$lw >> $wer_file

    CUDA_VISIBLE_DEVICES=0 \
    ~/torch/install/bin/th translate.lua \
    -src $src_file -tgt $tgt_file -model $model \
    -lm_model CI_news-uniq.en.tok_lm_r1024_wv500_l2_epoch31_2.26.t7 \
    -lm_weight $lw -output $output -batch_size 1 -gpuid 1 -idx_files true > $full_output

    grep GOLD $full_output | grep -v SCORE | perl -pe 's/.*: //g;s/ //g;s/_/ /g' > $gold_file
    grep PRED $full_output | grep -v SCORE | perl -pe 's/.*: //g;s/ //g;s/_/ /g' > $pred_file

    python wer++.py $pred_file $gold_file >> $wer_file
  done
done

WER_LM

sy2358 · October 28, 2017, 6:19am

Thank you so much for this contribution Homin!

I have done the feature extraction of the wsj, and put them into my OpenNMT folder. I have also ran the following command after Torch install, as you kindly provided.
$ TORCH_LUA_VERSION=LUA52 ./install.sh

However, when I run preprocess.lua, I get the following error message that onmt related files are not installed in my Torch folder.
no field package.preload[‘onmt.init’]
no file '/home/kathy/.luarocks/share/lua/5.2/onmt/init.lua’
no file '/home/kathy/.luarocks/share/lua/5.2/onmt/init/init.lua’
no file '/home/kathy/torch/install/share/lua/5.2/onmt/init.lua’
no file '/home/kathy/torch/install/share/lua/5.2/onmt/init/init.lua’
no file '/home/kathy/.luarocks/share/lua/5.1/onmt/init.lua’
no file '/home/kathy/.luarocks/share/lua/5.1/onmt/init/init.lua’
no file '/home/kathy/torch/install/share/lua/5.1/onmt/init.lua’
no file '/home/kathy/torch/install/share/lua/5.1/onmt/init/init.lua’
no file ‘./onmt/init.lua’

How can I install these dependent files in my Torch? I am planning to teach a course next Monday using this, and I would be very thankful if you could help me resolve the issue.

Many thanks Homin!

homink · October 28, 2017, 6:45am

Hi Kathy,

Glad to see you in OpenNMT Forum.

I wonder what OS you are working on because I have never seen such error messages in such Linux OS as Ubuntu and CentOS so far. How about LuaJIT? Can you install it? You can get the installation information in the following link.

sy2358 · October 28, 2017, 7:00am

Thank you for the quick feedback. It was a path issue, and after moving my files (the extracted features) to the OpenNMT directory, I do not get this error message.