OpenNMT Forum

Speech to text MFCC use at pre processing level

How can I implement Mel Frequency cepstral coefficient (MFCC) in OpenNMT-py preprocess.py?
Is it possible in OpenNMT-py?

You can have a look around here.

I looked into it. But I am not pro at python language. Can you tell me , at preprocess level Does MFCC algo use by OpenNMT or it only consider spectogram as final token?

I tried this step,but got stuck at this error.
RuntimeError: input.size(-1) must be equal to input_size. Expected 201, got 12.
the spectogram array shape is (201,x).
but mfcc array shape is (12,x).
How to resolve this issue?

I suspect this is because the features you add don’t match the dimensions that the model is expecting.

Can I change this?

Can I change the Model expected dimension?

You surely can, but you’ll probably have to keep diving in the code and see for yourself. It will also probably depend on the model you’ll want to use.
The audio codepath was introduced by @da03 some time ago now and is not extensively worked on at the moment, and only compatible with RNN models IIRC.

By the way, when reporting errors, please post the full trace and the command line(s) that triggered it.

Got this error:
RuntimeError: input.size(-1) must be equal to input_size. Expected 201, got 12.
I changed the input size. After that it is going fine.
Can you tell me what is better between MFCC and spectogram?

I think the consensus is leaning towards MFCC, but again, it may depend on your task, model, etc. You may search for some literature and do experiments accordingly.

I tried with MFCC on OpenNMT-py.
Made some changes in features of encoder sript audio_dataset.py.
training is on and waiting for results.

Hi Franc,
I am working on optimiser thing .
I am using SGD as an optim.
do you have any Idea about NO. of training steps , it will take?

  • It will depend on your task / data / #parameters / configuration / hardware;
  • I did not extensively experiment with speech on OpenNMT-py so I wouldn’t be able to share any numbers.

But, based on the other post where you mentioned having 40GB of data, I’d say you may need at least several tens of thousands of steps, but don’t take my word for it.

If you get interesting experiments, feel free to share some config / results / description of your task so that it can help others!

Need some help.
Can you tell me ,where I can find the loss function in OpenNMT-py?

Hi Franc,
I have done preprocessing using MFCC , but model training get stuck in early steps.

python3 train.py -model_type audio -enc_rnn_size 2048 -dec_rnn_size 1024 -audio_enc_pooling 1,1,1,1,2,2,2,2 -dropout 0.1 -enc_layers 8 -dec_layers 6 -rnn_type LSTM -data data/speech/demo -save_model models-global_attention mlp -batch_size 8 -save_checkpoint 10000 -optim adam -max_grad_norm 100 -learning_rate 0.01 -learning_rate_decay 0.5 -decay_method rsqrt -train_steps 150000 -encoder_type brnn -decoder_type rnn -normalization tokens -bridge -window_size 0.025 -image_channel_size 3 -gpu_ranks 0 -world_size 1

Step 30000/150000; acc: 15.65; ppl: 268.82; xent: 5.59; lr: 0.00017;
Step 40000/150000; acc: 17.03; ppl: 253.23; xent: 5.53; lr: 0.00015;
Step 50000/150000; acc: 17.06; ppl: 229.16; xent: 5.43; lr: 0.00013;
Step 60000/150000; acc: 17.52; ppl: 225.95; xent: 5.42; lr: 0.00012

TRAINING accuracy improving very slowly.
What should be done for faster convergence of training?