Speech-to-text using Convolutional LSTM layers

I’m intrigued by the recent change to allow Kaldi MFCC input vectors.
Is there any reason why we can’t use OpenNMT to build a speech-to-text model? I’m wondering if we could combine it with a model like this deep conv network that got ~10% WER by implementing Convolutional LSTM layers, but I haven’t explored the OpenNMT code deeply enough to know significant of a change that would be (in terms of architectural changes) to implement.
Does anyone have any feedback on this idea?

Yes, people are working on convolutional encoders. You can also take a look at Im2Text.

We already have a pyramidal encoder that is described in Listen, Attend and Spell (Chan et al. 2016). However, using plain LSTM on speech frames is known to require a lot of memory.

1 Like