Transcript Chapter 2
Speech Recognition Chapter 3 Speech Front-Ends Linear Prediction Analysis Linear-Prediction Based Processing Cepstral Analysis Auditory signal Processing Linear Prediction Analysis Introduction Linear Prediction Model Linear Prediction Coefficients Computation Linear Prediction for Automatic Speech Recognition Linear Prediction in Speech Processing How good is the LP Model. Signal Processing Front End Convert the speech waveform in some type of parametric representation. sk Signal Processing Front End O=o(1)o(2)..o(T) Filterbank Linear Prediction Front End Linear Prediction Coefficients Introduction In short intervals, it provides a good model of the speech. Mathematical precise and simple. Easy to implement in software or hardware. Works fine for recognition applications. It also has applications in formant and pitch estimation, speech coding and synthesis. Linear Prediction Model Basic idea: sn a1sn1 a2sn2 apsn p a1 , a2, ap are called LP(Linear Prediction) coefficients. By including the excitation signal, we obtain: p sn ai sn i Gun i 1 where un is the normalised excitation and is the gain of the excitation. G In the z-domain (secc. 1.1.4, pp. 15, Deller) p S( z) ai z i S( z) GU ( z) i 1 leading to the transfer function (Fig. 3.27) S ( z) G H ( z) p U ( z) i 1 ai z i 1 LP model retains the spectral magnitude, but it has a minimum phase (Sec. 1.1.7, Deller) feature. However, in practice, phase is not very important for speech perception. Observation: H(z) models the glottal filter(G(z)) and the lips radiation(R(z). Linear Prediction Coefficients Computation Introduction Methogologies Linear Prediction Coefficients Computation LP coefficients can be obtained by solving the next equation system (Secc. 3.3.2, Prove M a s i 1 s m k m i i m sm smi , m i=1,...,M ): Methodologies Autocorrelation Covariance Method Method – Not commonly used in Speech Recognition Autocorrelation Method Assumptions: Each frame is independent (Fig. 3.29 ). Solution (Juang, secc. 3.3.3 pp105-106): M a r k 1 where k ik ri ri = sm sm i m i=1,...,M M 1 i s q0 s qi q (2) These equations are know as Yule-Walker equations. Using matrix notation: Ra r Symetric. where r0 r 1 R r2 rM r1 r0 r1 r2 r1 r0 rM 1 rM 2 1 rM rM 1 rM 2 Diagonal elements r0 are the same. Toeplitz Matriz This matrix is known as Toeplitz. A linear system with this matrix can be solved very efficient. Examples (Fig. 3.32 and 3.33 ) Example (Fig. 3.34 ) Example (Fig. 3.35 ) Example (Fig. 3.36 ) Linear Prediction for Automatic Speech Recogition To minimise signal discontinuity Flats the spectrum equation (2) usually M=8 Incorporate signal dynamics to minimise To Cepstral noise sensitivity Coefficients Durbin Algorithm Preemphasis The transfer function of the glottis can be modelled as follows: U g ( z) 1 1 z 1 z 1 1 1 0.9 1 , 2 10 . 2 The radiation effect can be modelled as follows: 1 R( z) 1 z , 0.9 10 . Hence, to obtain the transfer function of the vocal tract the other pole must be cancelled as follows:. H( z) 1 z , 1 0.9 10 . Preemphasis sould be done only for sonorant sounds. This process can be automated as follows. rs (1) rs (0) where rs (i ) is the autocorrelation function. 1 0 for sonorant sounds for no - sonorant sounds N samples size frame, M samples frame shift N samples size frame, M samples frame shift Minimize signal discontinuities at the edges of the frames. A typical window is the Hamming window. 2n w( z) 0.54 0.46 cos , N 1 0 n N 1 LPC Analysis Converts the autocorrelations coefficients into LPC “parameter set”. LPC Parameter set – LPC coefficients – Reflection (PARCOR) coefficients – log area ratio coefficients The formal method to obtain the LPC parameter set is know as Durbin’s method. Durbin’s method for ( E 0=r0 , p 1; p pmax ; p ) p 1 kp rp a ip 1 rp i i 1 E p-1 a ip k p for (i=1; i p-1; i ) p p 1 k a ai p a p i p i a 1 p 0 E p (1 k ) E p 1 2 p r0 E0 r1 1 1 k E0 2 1 k1 E1 a1(1) 1 r2 a i(1) r2 i i 1 k2 k1 k 2 1 1 k 2 E1 2 E2 a1( 2 ) a 2( 2 ) k3 2 r3 a i( 2 ) r3i i 1 k3 1 a1( 3) a 2( 3) a 3( 3) k1 k 2 k 3 1 k 3 E3 E3 2 LPC (Typical values) Para- Fs meter 6.67KHz N 300 (45msec) M 100 (15msec p 8 Fs 8kHz 240 (30msec) 80 (10msec) 10 Fs 10KHz 300 (30msec) 100 (10msec) 10 Q 12 12 12 K 3 3 3 LPC Parameter Conversion Conversion to Cepstral Coeficients. Robust feature set for speech recognition. Algorithm: c0 ln 2 m 1 k cm am ck am k , k 1 m 1 m p m 1 k cm ck am k , k 1 m m p Parameter weighting low-order cepstral coefficents are highly sensibles to noise Temporal Cepstral Derivative First or second order derivatives is enough. It can be aproximated as follows: cm (t ) K kc k K m (t k ) p ~ sn ai sn i Given i 1 p sn ai sn i Gun i 1 p en sn ~ sn sn ai sn i p i 1p 2 2 ~ En ( sn sn ) en E n a k i 1 i 1 Hamming Windowed Large prediction errors since speech is predicted form previous samples arbitray set to zero. Large prediction errors since speech is predicted form previous samples arbitray set to zero. Unvoiced signals are not position sensitive. It does not show special effect at the edges. Observe the “whitening” phenomena at the error spectrum. Observe the “whitening phenomena at the error specturm Observe the error wave periodicity behaviour taken as bases for the Pitch Estimators. •Observe that a sharp decrease in the prediction error is obtain for small M value (M=1...4). •Observe that unvoiced signal has higher RMS error. Observe the all-pole model ability to match the spectrum. Linear Prediction in Speech Processing LPC for Vocal Tract Shape Estimation LPC for Pitch Detection LPC for Formant prediction LPC for Vocal Tract Shape Estimation Free of glottis and radiation effects Vocal Tract Shape Estimation To minimise signal discontinuity kn 0n N to minimise To Cepstral noise sensitivity Coefficients Parameter Calculation Parameter Calculation Durbin’s Method (As in Speech Recognition) – In case, this method is used, first the autocorrelation analysis should be performed. Lattice Filter Lattice Filter The reflection coefficients are obtain directly form the signal, avoiding the autocorrelation analysis. Methods: – Itakura-Saito (Parcor) – Burg – New forms Advantage: – Easier to implement in Hardware Disadvantage: – needs around 5 times more calculation. Itakura-Saito (PARCOR) kp ( p 1) ( p 1) e n fn n e n f ( p 1) 2 n ( p 1) 2 n en( p 1) n where en( p) en( p1) k p f n( p1) f n( p) k p en( p11) f n(p11) It can be shown that the PARCOR coefficients, obtain for the Itakura-Saito method are exactly the same as the reflection coefficients obtained by the Levison Durbin algorithm. kp en( p ) kp f n( p1) f n( p) Accumulates over time (n). Example Burg ( p 1) ( p 1) e n fn kp en( p 1) n e f ( p 1) 2 n n ( p 1) 2 n kp kp z 1 f n( p1) n where ( p) n e f ( p) n ( p1) n e k e ( p1) p n k f ( p1) p n1 f en( p ) ( p1) n1 Example f n( p) Example e sn z ( 0) n 1 f ( 0) n e (1) n e en( p1) ( p 2 ) n k1 k p1 kp k1 k p1 kp z 1 f ( 0) n f n( p2) z 1 f n( p1) en( p ) z 1 f n( p) Itakura-Saito Burg New Forms Stroback, New forms of Levinson and Schur algorithms, IEEE Signal Processing Magazine, pp. 12-36, 1991. Vocal Tract Shape Estimation From: An+1 An k n= An+1 An 0n N We obtain 1 kn An1 An= 1 k n 0n N Therefore, by setting the the lips area to an arbitrary value we can obtain the vocal tract configuration relative to the initial condition. This technique as been succesfully used to train deaf persons. LPC for Pitch Detection Speech Sampled at 10KHz LPF 800Hz DownSampler 5:1 Inverse Filering A(z) LPC Analysis Autocorrelation Peak finding V/U decision or Pitch LPC for Formant Detection Sampled Speech LPC Spectrum LPC Analysis Emphasis Peaks (second derivative) Peak Formants finding LPC Spectrum LP assumes that the vocal tract system can be modelled with an all-pole system: 1 A( z ) The spectrum can be obtain by z e j 1 A(e j ) In order to emphasis formant peaks we j z e 0.9 1.0 can set Therefore P A( z ) a z i i i 0 Spectrum (DTFT) A( z) z e j P a i 0 i i e ji Spectrum (DFT) A( e j ) 2k N P a e i i 0 2k j i N i In order to increase the spectral resolution we pad with zeros: A( e j 2k N N ) a e i i 0 i 2k j i N N>>P; In order to use an FFT algorithm N 2 n ai =0 i>P Caclulate the Spectral magnitude(DFT) A( e j 2k N ) k=0,...,N-1 Invert the Spectral magnitude(DFT) 1 A( e 2 k j N k=0,...,N-1 ) This spectrum is called the LPC Spectrum. How good is the LP Model As shown by the physiological analysis of the vocal tract the speech model is as follows: L ( z) G 1 bi z i i 1 R 1 a i z i i 1 However, it can be shown ( ), that LP Model is good for estimating the magnitude of pole-zero system. Prove According to lema 1 ( ) and lema 2 ( , ( z) can be written as follows: 1 ( z) G I i 1 a z i i 1 ) ap ( z ) All pass component The estimates a are calculated such that it correspond to the a of this model. i i Since ap ( z ) 1 hence ( z) G min ( z) therefore, if the estimators,a are exacts, then at least we obtain a model with a correct magnitude. i Lema 1 Lema 1(System Decomposition): – Any causal ration system L ( z) G 1 bi z i i 1 R 1 a i z i i 1 – can be descomponed as (prove ( z) Gmin ( z)ap ( z) Minimal phase component ): Prove For two poles and two zeros: ( z) (z 1 )( z 1 * ) ( z p)( z p * ) 1 1 ( z )( z * ) ( z )( z * ) ( z) * * ( z p )( z p ) ( z )( z ) 1 1 ( z )( z * ) ( z )( z ) ( z) * * ( z p)( z p ) ( z )( z ) * Lets define: 1 1 ( z )( z * ) x ( z) * ( z )( z ) Re-arranging this equation: (z 1)( * z 1) x ( z) * ( z )( z * ) 1 z 2 ( z 1 )( * z 1 ) x ( z) 2 ( z )( z * ) z 2 ( z 1 )( z 1 * ) x ( z) 2 ( z )( z * ) x ( z) z2 2 A( z 1 ) A( z) H (e ) H ( z ) H ( z 1 ) | z e j e j x (e j ) 2 2 j With the knowledge that: 2 2 4 x (e j ) A( z 1 ) A( z ) 1 A ( z ) A ( z ) z e j 2 1 4 Hence: x (e j ) 2 G ap (e j 2 ) Therefore: G 1 4 ( z )( z *) min ( z) ( z p )( z p *) 1 1 ( z )( z * ) ap ( z) ( z )( z *) End of prove. Lema 2 Lema 2: Minimum phase component can be expresed as an all-pole system: min ( z ) 1 I 1 a i z i i 1 in theory I goes to infinity, in practice is limited. Linear Prediction Based Procesing Critics to the Linear Prediction Model Perceptual Linear Prediction (PLP) LP Cepstra Critics to the Linear Prediction Model The LP spectrum approximate the speech spectrum equally well at all frequencies of the analysis band. This property is inconsistent with the human hearing. Precepual Linear Prediction (PLP) Critical Band Spectral Analysis Equal Loudness Pre-emphasis Intensity Loudness IDFT Yule-Walker Equations Solutions ai Critical Band Analysis Speech Signal Frame Windowing DFT (20 ms Hamming Window Short-Term Spectra Critical Band Spectral Resolution DFT (20 ms) (200 samples 56 zeros for padding for Ts=10KHz)c Critical-Band Spectral Resolution ( i ) P( ) Frequency Warping (Hertz -> Barks) 0.5 2 ( ) 6 ln 1 1200 1200 P(( )) 2.5 P( )() 1.3 i=1,...,18 Convolution and Downsampling filter-bank masking curve approximation 0 10 2.5 ( 0.5) 1 ( ) 10 1.0 ( 0.5) 0 13 . 13 . 0.5 0.5 0.5 0.5 2.5 2.5 i Equal Loudness Pre-emphasis Approximate the non-equal sensitivity of the human hearing at different frequencies. 2 56.8x106 4 ( ) 2 6 2 6.3x10 2 0.38x109 6 9.58x1026 Intensitive Loudnes Power Law Approximate the non-linear relation between the intensity of sound and its perceived loudness. ( ) () 3 Cepstral Analysis Introduction Homomorphic Processing Cepstral Spectrum Cepstrum Mel-Cepstrum Cepstrum in Speech Processing Introduction When speech is pre-emphasised S ( ) E ( ) H ( ) The excitation is not necessary for estimate the vocal tract function. Therefore, it is desirable to separate the excitation information form the vocal tract information. H( ) E( ) We can think the speech spectrum as a signal, we can observer that is composed for the multiplication of a slow signal, H( ) and a fast signal, E( ) . S ( ) E ( ) H ( ) Therefore, we can try to obtain the best of this knowledge. The formal technique which exploit this feature is called “Homomorphic Processing”. Homomorphic Processing It is a technique to filter no-lineal systems. In Homomorphic Processing the nonlinear related signals are transform the signal to a linear domain. H[ ] F(z) H-1[ ] In order to obtain a linear system a complex log transformation is applied to the speech spectrum. log[ ] S+(z) exp[ ] S () log E() log H() Cepstral Spectrum Definition. 2 2 s (n) log s(n) log S (e jT )e jnT d 2 0 1 where is the STFT log S (e ) log S (e ) e jT jT jS ( e jT ) log S (e jT ) jS (e jT ) Cepstrum Definition. cs (n) Re log s(n) 1 2 2 jT jnT log S ( e ) e d 2 0 Cepstrum In Speech Processing Pitch Estimation Format Estimation Pitch and Formant Estimation Pitch Estimation Sampled Speech Cepstrum High-Pass Liftering Emphasis Peaks (second derivative) Peak finding Pitch Formant Estimation Sampled Speech Cepstrum Low-Pass Liftering Emphasis Peaks (second derivative) Peak finding Formants Pitch and Formant Estimation Sampled Speech Cepstrum High-Pass Liftering Emphasis Peaks (second derivative) Peak finding Pitch Low-Pass Liftering Emphasis Peaks (second derivative) Peak finding Formants