Speech Rec 2

Transcript Speech Rec 2

Speech Recognition
Feature Extraction
Speech recognition simplified block diagram
Training
Speech
Capture
Feature
Extraction
Models
Pattern
Matching
Process
Results
Text
Speech capture
•
•
•
•
•
•
•
Use good quality noise cancelling mic
Use bandwidth of 4kHz for phone
Use bandwidth of 8kHz for desktop
Sample at 8kHz or 16 kHz
Alias filter the input
Avoid background noise
Speak clearly but naturally
Spectral Features
• Need to extract key frequency components
• Visible in a spectrogram – 2d real time
examples
Feature extraction
• Need to extract frequency content (spectrogram)
• Matching on raw data is inefficient
– Much of the data is redundant for information
– Analyse the signal and extract key features
– The same word spoken by different people looks very
different in time domain
– In the frequency domain, patterns are more evident
• Generally use Mel Frequency Cepstral
Coefficients
The process
• MFCCs are short-term spectral features
They are calculated as follows
– Divide signal into frames
– For each frame, obtain the amplitude
spectrum
– Take the natural logarithm
– Convert to Mel spectrum (cepstrum)
– Take the discrete cosine transform (DCT)
Divide signal into frames
Apply window function – typically Hamming window
• Select about 25mS of speech data and window it to cleanly
cut it out of the data stream
• Shift window by about 10mS and do the same continuously
• Why Hamming? Why not rectangular?
• Now have a series
of vectors being
produced
– If sampling at 8kHz
then sample period =
125uS
– Vector size =
25mS/125uS =
25000 / 125 = 200
element array
• Feed the speech frame into an FFT to get
frequency component of that slice
• Calculate the power of the spectrum for each
element of the vector
– s[k]=(Real X[k])2 + (Imag X[k])2 where X is FFT coef
• Use a set of filters to split up frequency bands
– Typically use mel scale filter to match the Basilar
Membrane. Get energy in each band
– Sphinx III uses 40 filters over 8kHz bandwidth
• Frequency response is non-linear
– Mel(ody) = 1127.01048 x log_e(1+f/700)
– f = 700(e^{m x 1127.01048} – 1)
– Bark =13 x arctan(0.76f x 1000) + 3.5 x arctan((f x 7500)^2)
• Calculate mel spectrum by multiplying the power
spectrum by each of the of the triangular mel
weighting filters and integrating the result.
• Calculate the mel cepstrum
– A DCT is applied to the natural logarithm of the mel spectrum to
obtain the mel cepstrum. C=num of cepstral coefficients required
(n=0 to 12 to get 13 for Sphinx III) and L is the number of filter
banks and S[i] is the mel spectrum coefficient – one for each filter
output. n is usually less than C as the DCT has the effect of
compressing the spectrum such that the bulk of the information is
in the first few coefficients. Sphinx III uses 40 filters but keeps
only the first 13 cepstral coefficients.
Default values for the SPHINX III front-end
• Typical Feature Extraction Block Diagram

Speech Rec 2

Transcript Speech Rec 2

Directory