Transcript Slide 1
Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition Nengheng Zheng Supervised under Professor P.C. Ching Nov. 26 , 2004 Outline • Speech production and glottal pulse excitation in detail • Linear prediction: short-term and Long-term • Glottal spectrum estimated with long-term prediction and acoustic features • For speaker recognition implementation Speech Production Discrete time model for speech production Impulse train generator AV Glottal pulse model G(z) X u(n) Vocal tract model V(z) Random noise generator Radiation model R(z) s(n) X AN A combined transfer function H ( z ) G( z)V ( z) R( z ) Glottal pulses Vocal tract Speech signal Acoustic Features of Glottal Pulse • Time domain – – – – – pitch period pitch period perturbation (jitter) pulse amplitude perturbation (shimmer) glottal pulse width abruptness of closure of the glottal flow – aspiration noise • Frequency domain – fundamental frequency (F0) – spectral tilt (slope) – harmonic richness Glottal Pulse and Voice Quality • Glottal pulse shape plays an important role on the quality of Natural or synthesized vowels [Rosenberg 1971] – The shape and periodicity of vocal cord excitation are subject to large variation – Such variations are significant for preserving the speech naturalness – A typical glottal pulse: asymmetric with shorter falling phase; spectrum with -12dB/octave decay • More variation among different speakers than among different utterance of the same speaker [Mathews 1963] • Such variations have little significance for speech intelligibility but affect the perceived vocal quality [Childers 1991] Various Glottal Pulses • Some other vocal types breathy falsetto • Temporal and spectral characteristics vocal fry Some Comments • Generally, to study the glottal pulse characteristics, it is necessary to rebuilding the glottal pulse waveform by inverse filtering technique • Automatically and exactly rebuilding the glottal waveform from real speech is almost impossible, especially, at the transient phase of articulation, or, for high pitched speakers • Fortunately, it is possible to estimate the glottal spectrum from residual signal with pitch prediction Linear Prediction • Speech waveform: correlation between current and past samples and thus predictable p • Short-term correlation: s ( n ) ak s ( n k ) k 1 • Occurs within one pitch period • Formant modulation • Classical linear prediction analysis (short-term prediction) • Long-term correlation u(n) bu(n p) • occurs across consecutive pitch periods • Vocal cords vibration • Long-term/pitch prediction Linear Prediction • Short-term predictor <classical linear prediction> P A( z ) 1 ak z k k 1 – Remove the short-term correlation and result in a glottal excitation signal P u ( n ) s ( n ) ak s ( n k ) k 1 • Long-term predictor <pitch prediction> P( z) 1 b1 z ( p1) b0 z p b1 z ( p1) – Remove the correlation across consecutive periods 1 v(n) u (n) bk u (n p k ) k 1 s(n) M ai z i u(n) _ i 1 Short-term predictor + 1 bk z ( p k ) v(n) _ k 1 Long-term predictor + Linear Prediction: A example 1 s(n) 0 2 10 -1 0 100 100 200 300 400 500 600 700 800 400 500 600 700 800 500 600 700 800 0.5 -2 10-2 10 0 0 1000 1000 2000 2000 Frequency (Hz) 3000 3000 4000 4000 Frequency (Hz) 1 Intensity Intensity(dB) (dB) 0 10 101 PowerSpectrum Power Spectrum Magnitude Spectrum Magnitude (dB) Power Power SpectrumMagnitude Magnitude(dB) (dB) (dB) intensity (dB) intensity (dB) 2 10 0 -0.5 10 0 0.5 0 0 10 10 0 0 1000 2000 2000 Frequency (Hz) (Hz) 3000 3000 4000 4000 -0.5 40 20 u(n) -20 -20 -40 -40 100 -60 -60 00 200 0.2 0.2 -20 -20 300 v(n) 0.4 0.6 0.4 0.6 Frequency Frequency 0.8 0.8 11 -40 -40 -60 -60 0 100 -80 -800 0 200 0.2 0.2 300 0.4 0.6 0.4 0.6 Frequency Frequency 400 0.8 0.8 1 80 1 60 40 20 Examples of pitch prediction estimated glottal spectrum 40 20 0 0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500 6 4 2 0 2 1.5 1 0.5 8 6 4 2 0 Harmonic Structure of Glottal Spectrum • Two parameters describing the harmonic structure – Harmonic richness factor and Noise-to-harmonic ratio • Harmonic richness factor (HRF) • Noise-to-harmonic ratio (NHR) HRFn 10log NHRn 10 log H H i Bn i H1 N i H i N i Bn H i Bn 10 H i 5 0 0 200 400 o Ni 600 800 1000 1200 1400 1600 1800 2000 Feature Generation • Acoustic features including the following: – Fundamental frequency F0 – Pitch prediction gain g u ( n) g 10 log v ( n) 2 2 – Pitch prediction coefficients b-1, b0, b1 – HRFn and NHRn <n=1:10> • 10 Mel scale frequency bank • Feature generation process s(n) S-T prediction u(n) L-T prediction on every pitch period p, g, bi G(z)G(f) Mel-scale Bank pass filtering HRFn, NHRn, n=1,2,…, Experiments Conditions • Speech quality: telephone speech • Subject: 49 male speakers • Training condition: – 3 training session, about 90s speech totally, over 3~6 weeks – 128 GMM • Testing condition: – 12 testing sessions. Over 4~6 months. Speaker recognition experiments • Identification results with long-term prediction related features Feature F0 g [b-1 b0 b1] HRF NHR Iden. Rate 18% 11% 14% 32% 17% • Comparison of glottal source feature with classical features Features Identification error rate (%) Fgs: F0_g_HRF_NHR25 52% LPCC_D_A36 2.84 LPCC_D_A+Fgs 2.26 MFCC_D_A 2.1 MFCC_D_A+Fgs 1.9 Summary • Glottal source excitation is important for perceptional naturalness of voice quality and is helpful for distinguishing a speaker from the others. • Linear prediction is a powerful tool for speech analysis. The spectral property of the supraglottal vocal tract system can be estimated by short-term prediction; While the long-term prediction estimates the spectrum of the glottal excitation system • Recognition results show that the glottal source related acoustic features (F0, prediction gain, HRF, NHR, etc.) provide a certain degree of speaker discriminative power. Other Applications • Speech coding • Speech recognition ? • Speaking emotion recognition ! Thank You!