Speech to text MFCC use at pre processing level

amittiwari · October 21, 2019, 6:55am

How can I implement Mel Frequency cepstral coefficient (MFCC) in OpenNMT-py preprocess.py?
Is it possible in OpenNMT-py?

francoishernandez · October 21, 2019, 7:17am

You can have a look around here.

amittiwari · October 21, 2019, 1:32pm

I looked into it. But I am not pro at python language. Can you tell me , at preprocess level Does MFCC algo use by OpenNMT or it only consider spectogram as final token?

amittiwari · October 23, 2019, 12:49pm

I tried this step,but got stuck at this error.
RuntimeError: input.size(-1) must be equal to input_size. Expected 201, got 12.
the spectogram array shape is (201,x).
but mfcc array shape is (12,x).
How to resolve this issue?

francoishernandez · October 23, 2019, 12:51pm

I suspect this is because the features you add don’t match the dimensions that the model is expecting.

amittiwari · October 23, 2019, 12:54pm

Can I change this?

amittiwari · October 23, 2019, 1:01pm

Can I change the Model expected dimension?

francoishernandez · October 23, 2019, 1:24pm

You surely can, but you’ll probably have to keep diving in the code and see for yourself. It will also probably depend on the model you’ll want to use.
The audio codepath was introduced by @da03 some time ago now and is not extensively worked on at the moment, and only compatible with RNN models IIRC.

By the way, when reporting errors, please post the full trace and the command line(s) that triggered it.

amittiwari · October 24, 2019, 9:39am

Got this error:
RuntimeError: input.size(-1) must be equal to input_size. Expected 201, got 12.
I changed the input size. After that it is going fine.
Can you tell me what is better between MFCC and spectogram?

francoishernandez · October 24, 2019, 10:01am

I think the consensus is leaning towards MFCC, but again, it may depend on your task, model, etc. You may search for some literature and do experiments accordingly.

amittiwari · October 24, 2019, 11:04am

I tried with MFCC on OpenNMT-py.
Made some changes in features of encoder sript audio_dataset.py.
training is on and waiting for results.

amittiwari · October 25, 2019, 6:46am

Hi Franc,
I am working on optimiser thing .
I am using SGD as an optim.
do you have any Idea about NO. of training steps , it will take?

francoishernandez · October 25, 2019, 7:21am

It will depend on your task / data / #parameters / configuration / hardware;
I did not extensively experiment with speech on OpenNMT-py so I wouldn’t be able to share any numbers.

But, based on the other post where you mentioned having 40GB of data, I’d say you may need at least several tens of thousands of steps, but don’t take my word for it.

If you get interesting experiments, feel free to share some config / results / description of your task so that it can help others!

amittiwari · October 25, 2019, 12:25pm

Need some help.
Can you tell me ,where I can find the loss function in OpenNMT-py?

amittiwari · November 19, 2019, 6:15am

Hi Franc,
I have done preprocessing using MFCC , but model training get stuck in early steps.

python3 train.py -model_type audio -enc_rnn_size 2048 -dec_rnn_size 1024 -audio_enc_pooling 1,1,1,1,2,2,2,2 -dropout 0.1 -enc_layers 8 -dec_layers 6 -rnn_type LSTM -data data/speech/demo -save_model models-global_attention mlp -batch_size 8 -save_checkpoint 10000 -optim adam -max_grad_norm 100 -learning_rate 0.01 -learning_rate_decay 0.5 -decay_method rsqrt -train_steps 150000 -encoder_type brnn -decoder_type rnn -normalization tokens -bridge -window_size 0.025 -image_channel_size 3 -gpu_ranks 0 -world_size 1

Step 30000/150000; acc: 15.65; ppl: 268.82; xent: 5.59; lr: 0.00017;
Step 40000/150000; acc: 17.03; ppl: 253.23; xent: 5.53; lr: 0.00015;
Step 50000/150000; acc: 17.06; ppl: 229.16; xent: 5.43; lr: 0.00013;
Step 60000/150000; acc: 17.52; ppl: 225.95; xent: 5.42; lr: 0.00012

TRAINING accuracy improving very slowly.
What should be done for faster convergence of training?