Speech to text data preprocessing

I am learning about speech preprocessing. I have some questions.
what is the token in speech to text?
what is the difference between window stride and window size?

I know that samples are used at preprocess level. But when you go beyond sample level, what are the features of sound waveform that matters most?