Monophone speech recognition with OpenNMT

seqtagger
inputvector

#1

This tutorial describes how to perform monophone speech recognition with OpenNMT Sequence Tagger.

#1. Data preparation
OpenNMT needs source and target for training. For the monophone speech recognition, source is a sequence of acoustic feature and target is a sequence of monophone speech. Here we make use of TIMIT corpus where monophones are annotated with timestamp of audio file. The following Github repository helps to build source and target using Kaldi speech recognition toolkit.

#2. Lua 5.2 installation
A sequence of acoustic feature is quite heavy to be running completely. Lua 5.2 helps to load such large data in OpenNMT. LuaJIT user will need to limit the length of source and target sequences. You can install Lua 5.2 as shown in the following.

git clone https://github.com/torch/distro.git ~/torch --recursive 
cd torch
TORCH_LUA_VERSION=LUA52 ./install.sh

#3. Running OpenNMT

th preprocess.lua -data_type 'feattext' \
                  -train_src train.fbank120.clean \
                  -train_tgt train.tgt.clean \
                  -valid_src dev.fbank120.clean \
                  -valid_tgt dev.tgt.clean \
                  -save_data TIMIT_clean_800 -check_plength false \
                  -idx_files -src_seq_length 800 -tgt_seq_length 800 
CUDA_VISIBLE_DEVICES=0 \
th train.lua -data TIMIT_clean_800-train.t7 \
             -save_model asr_timit_clean_800_seqtagger_brnn_sgd029 \
             -brnn -report_every 1 -rnn_size 512 -word_vec_size 65 \
             -model_type seqtagger -layers 3 -max_batch_size 8 \
             -learning_rate 0.29 -dropout 0 -learning_rate_decay 1 \
             -end_epoch 30 -gpuid 1
CUDA_VISIBLE_DEVICES=0 \
th tag.lua -src test.fbank120.clean \
           -model asr_timit_clean_800_seqtagger_dbrnn_sgd029_epoch10_2.74.t7 \
           -idx_files true -batch_size 1 -gpuid 1

/home/kwon/TIMIT3/asr_timit_clean_800_seqtagger_brnn_sgd029_epoch10_2.70.t7'...
[06/14/17 21:59:44 INFO] FEATS 1: IDX - mjdh0_si1984_dr6 - SIZE 131
[06/14/17 21:59:44 INFO] PRED 1: h# h# h# h# h# h# h# h# h# h# h# h# h# q dh ih ih ih ih m m m w w w w w w w ah ah ah ah ah ah ah dx dx dx ay ay ay ay ay ay ay ay ay ay ay ay ay ay ay ay ay ay ay ay ay ay s s z z z s s s th th dh dh dh dh ey ey ey ey ey ey ey w w w w w w w w w w w w w w ao er er er er er er er er er er er axr axr axr axr axr h# h# h# h# h# h# h# h# h# h# h# h# h# h# h# h# h# h#
[06/14/17 21:59:44 INFO]
[06/14/17 21:59:44 INFO] FEATS 2: IDX - mjdh0_si1984_dr6 - SIZE 145
[06/14/17 21:59:44 INFO] PRED 2: h# h# h# h# h# h# h# h# h# h# h# y y y y y y y hv uw uw uw ow ow ow ow ow ow ow ow ow ow ow q q q q q q q q iy iy iy iy iy iy iy iy iy iy dcl dx dx ih ih ih ih ix ix ix n n n n n ah ah ah ah ah ah ah ah ah ah ah ah ah ah f f f f f f f f f f f pau pau pau pau pau hh hh hh hh hh hh hh hh hh hh hh ah ah ah ah ah ah ah ah ah n n nx n iy iy iy iy iy iy iy iy iy iy h# h# h# h# h# h# h# h# h# h# h# h# h# h# h#
[06/14/17 21:59:44 INFO]
[06/14/17 21:59:44 INFO] FEATS 3: IDX - mjdh0_si1984_dr6 - SIZE 158
[06/14/17 21:59:44 INFO] PRED 3: h# h# h# h# h# h# h# h# h# h# h# h# h# sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh iy iy iy iy iy iy iy iy iy iy iy iy ih ih ih ih ih s s s s s s s s s s tcl tcl tcl t t t t t ih eh eh eh eh eh eh eh nx n n er er axr axr axr axr axr axr er er er er er dh dh dh dh dh ix ix ix ix ix ix ix ix n n n n n ay ay ay ay ay ay ay ay ay ay ay ay ay ay ay ay ay ay ay ay ay ay ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ux ux m h# h# h# h# h# h# h# h# h# h# h# h#
[06/14/17 21:59:44 INFO]

#4. Results

For PER(Phone Error Rate) calculation, you need to consider a group of phones. You will be able to try different options and parameters for the training.


Error out of range when input length is greater than output length [feattext mode]
Issue in training an end-to-end speech recognition system
(Vincent Nguyen) #2

Hello @homink,

nice to see this.
A few questions though:

  • you seem to take seq lengths of 800 and 800.
    In my understanding, there should be an average of 8 frames per monophone, which means 800 src frames should correspond to 100 target seq length, right ?

  • you seem to have used brnn and not pdbrnn. I did not get good results with pdbrnn either eventhough some papers mention this technique, did you test it ?

Cheers,
Vincent


(jean.senellart) #3

Hi @vince62s, for sequence tagging, we cannot use pdbrnn since it reduces the time-dimensionality.


#4

Hi @vince62s,

Thanks for your comments.For your question, I resized the length of monophone (A B C to A A B B B C C C C),which exactly matches the length of acoustic feature frames. Thus we have all same number of lengths between source and target. If you read the Github repository, you will see how I prepare 2 set of training data. The maximum length of acoustic feature is about 770 or 780 frames. Thus I set 800 to load everything.


(Menamine) #5

Hi @homink,
Thank you for the tutorial. I have a technical issue when preparing data with the script preprocess.lua. I got “Invalid line - missing idx”. I believe that the problem is caused by spaces before each vector features. So my question is: did you adapt the feature file generated by Kaldi (change the format for example) or you have used it directly to prepare your data?
Best,
Amine.


#6

Hi @menamine,

I had made use of the feature data produced by Kaldi which is available in https://github.com/homink/kaldi/tree/FeatureText/egs/timit/s5

Best,
Homin