Transcript Document
Speech processing NNSU Lab Summer 2003 Seminar
Intel
®
Integrated Performance Primitives
vs. Speech Libraries & Toolkits Math Inside & Outside
by Vitaly Horban [email protected]
Speech processing
Agenda
NNSU Lab Summer 2003 Seminar
Comparison Intel ® IPP 3.0 and speech libraries & toolkits Overview mathematical methods for speech processing General assessment of Intel ® IPP 3.0
Summary
Speech processing
Acronyms
LP RELP PLP AR LSP LSF MFCC MLSA DCT DTW SVD VQ RFC HMM ANN EM Linear Prediction Residual Linear Prediction Perceptual Linear Prediction Area Ratios or Autoregressive Line Spectrum Pairs Line Spectral Frequencies Mel-Frequency Cepstrum Coefficients Mel Log Spectral Approximation Discrete Cosine Transform Dynamic Time Warping Single Value Decomposition Vector Quantization Rise/Fall/Connections Hidden Markov Model Artificial Neural Network Expectation/Maximization NNSU Lab Summer 2003 Seminar
Speech processing
Acronyms
(continue)
CMS MLP LDA QDA NLDA SVM DWT LAR PLAR GMM WFST CART HNM MBR SR TTS Cepstral Mean Subtraction Multi Layer Perception Linear Discriminant Analysis Quadratic Discriminant Analysis Non-Linear Discriminant Analysis Support Vector Machine Discrete Wavelet Transformation Log Area Ratio Pseudo Log Area Ratio Gaussian Mixture Model Weighted Finite State Transducer Classification and Regression Trees Harmonic plus Noise Modeling Minimum Bayes Risk Speech Recognition Text-To-Speech synthesis NNSU Lab Summer 2003 Seminar
Speech processing
IPP vs. CMU Sphinx
Feature processing
LP
Power Spectrum Cepstrum LSP
Mel-scale values Mel-frequency filter bank Mel-cepstrum Linear scale values Acoustic & Language models
Gaussian mixture
Likelihood of an HMM state cluster HMM transition matrix
NNSU Lab Summer 2003 Seminar Feature processing
LP
Spectrum
Cepstrum MEL: filter, cepstrum, filter bank
PLP: filter, cepstrum, filter bank Language model
Context-free grammar N-gram model Acoustic model based on HMM
Each HMM state – set of Gaussian mixture
HMM order
HMM position HMM transition matrix Baum-Welch training
Speech processing
IPP vs. CSLU Toolkit
Feature processing
Power Spectral analysis (FFT)
Linear predictive analysis (LPC) LP reflection coefficients LSP
DCT RFC Cross correlation coefficients Covariance matrix
Mel-scale cepstral analysis Derivative functions Energy normalization Acoustic & Language model
VQ
Weights, Means and Variances EM re-estimation Viterbi decoding
NNSU Lab Summer 2003 Seminar Feature processing
Power spectral analysis (FFT)
Linear predictive analysis (LPC)
PLP Mel-scale cepstral analysis (MEL) Relative spectra filtering of log domain coefficients (RASTA) First order derivative (DELTA)
Energy normalization Language model
Word pronunciation Lexical trees Grammars Acoustic model based on HMM/ANN
VQ initialisation
EM training Viterbi decoding
Speech processing
IPP vs. Festival
NNSU Lab Summer 2003 Seminar Feature processing
Power Spectrum
Feature processing
Power spectrum
Reflection to Tilt, PitchmarkToF0, Unit Curve (RFC)
LPC
MEL
LSF
LP Reflection coefficients
Energy normalization
Acoustic & Language model
Tilt to RFC, RFC to Tilt, RFC to F0 LPC MEL LSF LP Reflection coefficients Fundamental frequency (pitch) Root mean square energy Language model
Viterbi decoding
N-gram model
Context-free grammar
WFST Acoustic model
Regular expressions CART trees Viterbi decoding
Speech processing
IPP vs. ISIP
Feature processing
Derivative functions
Spectrum Cepstrum Cross correlation
Covariance matrix Energy normalization Filter bank Area Ratio Durbin’s recursion Reflection coefficients (Schur)
Gaussian probability Acoustic & Language model
Viterbi decoding NNSU Lab Summer 2003 Seminar
Feature processing
Derivative functions Spectrum Cepstrum
Cross correlation Covariance matrix Covariance (Cholesky)
Energy (Log, dB, RMS, Power) Filter bank Log Area Ratio (Kelly-Lochbaum) Autocorrelation (Durbin recursion, Leroux-Guegen ) Lattice (Burg) Reflection coefficients Gaussian probability Acoustic & Language model (HMM)
N-gram model
Viterbi decoding Baum-Welch training
Speech processing
IPP vs. MATLAB
Frequency Scale Conversion
Mel scale Linear scale Transforms
DFT FFT DCT Distance
Euclidean
Mahalanobis DTW (observation and reference vector sequences) Bhattacharya
NNSU Lab Summer 2003 Seminar Frequency Scale Conversion
Mel scale Equivalent rectangular Bandwidths (ERB) Transforms
FFT (real data) DCT (real data) Hartley (real data) Diagonalisation of two Hermitian matrices (LDA, IMELDA) Vector distance
Euclidean
Squared Euclidean
Mahalanobis Itakura (AR, Power spectra) Itakura-Saito (AR, Power spectra)
COSH (AR, Power spectra) Speech enhancement
Martin spectral subtraction algorithm
Speech processing
IPP vs. MATLAB (continue)
Feature processing
LPC
Area ratio Spectrum Cepstrum
RFC DCT LSP LSF Reflection coefficients RFC Autocorrelation coefficients Cross correlation coefficients Covariance matrix Mel-scale cepstral analysis Derivative functions Energy normalization
NNSU Lab Summer 2003 Seminar LPC analysis and transforms
Area ratios
Autoregressive or AR
Power spectrum Cepstrum DCT Impulse response (IR)
LSP LSF Reflection coefficients Unit-triangular matrix containing the AR coefficients Autocorrelation coefficients Expand formant bandwidths of LPC filter Warp cepstral (Mel, Linear)
Speech processing
IPP vs. MATLAB (continue)
Speech Recognition
Feature processing
Model Evaluation Model Estimation Model Adaptation
Vector Quantization Speech coding (ITU G.711, G.723.1, G.729)
Linear PCM A-law
Mu-law VQ given codebook
NNSU Lab Summer 2003 Seminar Speech synthesis
Rosenberg glottal model
Liljencrants-Fant glottal model Speech Recognition
Mel-cepstrum
Mel-filter bank Cepstral & variances to power domain
Gaussian Mixture Speech coding (ITU G.711)
Linear PCM A-law Mu-law
VQ using K-means algorithm VQ using the Linde-Buzo-Gray algorithm
Speech processing
IPP vs. HTK
Feature processing
LPC
Area ratio Spectrum Cepstrum
RFC DCT LSP LSF Reflection coefficients Autocorrelation coefficients Cross correlation coefficients Covariance matrix Mel-scale cepstral analysis Derivative functions Energy normalization VQ NNSU Lab Summer 2003 Seminar
Feature processing
LPC
Spectral coefficients
Cepstral coefficients Reflection coefficients Gaussian distribution K-means procedure
PLP Autocorrelation coefficients Covariance matrix Mel-scale filter bank MFCC Third differential Energy VQ codebook
Speech processing
IPP vs. HTK (continue)
Model adaptation
EM training algorithm Acoustic & Language model
Viterbi decoding
Likelihood of an HMM state cluster
HMM transition matrix Speech coding (ITU G.711, G.723.1, G.729)
Linear PCM A-law
Mu-law VQ given codebook
NNSU Lab Summer 2003 Seminar Model adaptation
Maximum Likelihood Linear Regression (MLLR)
EM technique Bayesian adaptation or Maximum Aposteriori Approach (MAP) Acoustic & Language model based on HMM
Grammar
N-gram model
Viterbi training Baum-Welch training Speech coding
Linear PCM
A-law Mu-law
Speech processing
Possible extension IPP 3.0
Feature processing
PLP: filter, cepstrum, filter bank Relative spectra filtering of log domain coefficients (RASTA) Fundamental frequency (pitch) RMS energy Covariance (Cholesky) Energy (Log, dB, RMS, Power) LAR (Kelly-Lochbaum) Autocorrelation (Leroux-Guegen) Lattice (Burg) Equivalent Rectangular Bandwidths (ERB) Unit-triangular matrix (AR coef.) Expand formant bandwidths (LP) Third differential Hartley transform Diagonalisation of two Hermitian matrices (LDA, IMELDA)
NNSU Lab Summer 2003 Seminar Model adaptation
Maximum Likelihood Linear Regression (MLLR) Bayesian adaptation or Maximum Aposteriori Approach (MAP) Model evaluation
Itakura (AR, Power spectra)
Itakura-Saito (AR, Power spectra) COSH (AR, Power spectra) Speech synthesis
Rosenberg glottal model Liljencrants-Fant glottal model Speech enhancement
Martin spectral subtraction Speech coding
VQ using K-means algorithm
VQ using the Linde-Buzo-Graym Acoustic model based on HMM
Baum-Welch training
Speech processing
Speaker Characteristics
Feature processing
Preemphasize Cepstral Energy Cepstral Mean Subtraction (CMS) MFCC, LPCC, LFCC LPC (to Cepstral, to LSF) Residual Prediction Mel-cepstral Fundamental frequency (F0) LSF ( Bark scale ) RMS energy Levinson-Durbin recursion Covariance (Cholesky) Delta cepstral (Milner, High order) Pseudo Log Area Ratio (PLAR) DWT VQ
NNSU Lab Summer 2003 Seminar Acoustic model
Distance
Bhattacharya
DTW Euclidean Viterbi decoding EM ( Lloyd ) K-means (Lloyd) PLP MLP Twin-output MLP LDA NLDA Generative models
GMM
HMM ( Baum-Welch )
Speech processing
Speech Processing
Feature processing
LPC LSP F0 Levinson-Durbin recursion Tilt Gaussian Acoustic & Language model
Baum-Welch training Viterbi decoding CART Statistical language modeling
NNSU Lab Summer 2003 Seminar Speech enhancement Speech Analysis
Discrete Wigner Distribution DWT Pitch Determination Code Excited Linear Predictor (CELP)
Speech processing
Speech Recognition
Feature processing
DCT MFCC Mel-frequency log energy coefficients (MFLEC) Subband (SB-MFCC) CMS Within Vector Filtered (WVF-MFCC) Robust Formant (RF) algorithm Split Levinson Algorithm (SLA) Vector Quantization
VQ correlation Single VQ Joint VQ
NNSU Lab Summer 2003 Seminar Acoustic & Language model
Viterbi decoding
LDA QDA MLP PLP EM re-estimation Minimum Bayes Risk (MBR) Maximum Likelihood Estimation (MLE) NN (Elman predictive) HMM (Baum-Welch) GMM Buried Markov Model Decision tree state clustering WFST Dynamic Bayesian Networks
Speech processing
Speech Synthesis
Feature processing
MFCC Log Area Ratio (LAR) Bark frequency scale FFT Power Spectrum LPC LSF F0 Likelihood Ratio Residual LP Mel Log Spectral Approximation (MLSA) MLSA filter Covariance Energy Delta, DeltaDelta
NNSU Lab Summer 2003 Seminar Acoustic & Language model
Viterbi decoding
HMM (Baum-Welch) EM training WFST CART Harmonic plus Noise Modeling (HNM) Distance
Euclidean
Kullback-Leibler Mean Squared Log Spectral Distance (MS-LSD) Mahalanobis Itakura-Saito Symmetries Itakura RMS (root mean squared log spectral)
Speech processing
New Speech Functionality
Feature processing
Bark scale Fundamental frequency Likelihood Ratio Covariance (Cholesky) MLSA CMS SB-MFCC WVF-MFCC Robust formant algorithm Split Levinson algorithm LPCC LFCC RMS energy Delta cepstral (Milner, High order) Pseudo LAR PLP
NNSU Lab Summer 2003 Seminar Acoustic & Language model
HMM (Baum-Welch)
HNM MLP WFST
CART LDA, NLDA, QDA Minimum Bayes Risk (MBR) Maximum Likelihood Estimation
NN (Elman predictive) Discrete Wigner Distribution Code Excited Linear Predictor Distance
Kullback-Leibler
Mean Squared Log Spectral Distance (MS-LSD)
Itakura-Saito Symmetries Itakura RMS
Speech processing
Summary
NNSU Lab Summer 2003 Seminar
Intel ® IPP 3.0 now covers most useful primitives for speech processing Speech enabled applications require still more primitives Developers and researches need more samples
Speech processing
Thank You !
NNSU Lab Summer 2003 Seminar