Document

Download Report

Transcript Document

Speech processing NNSU Lab Summer 2003 Seminar

Intel

®

Integrated Performance Primitives

vs. Speech Libraries & Toolkits Math Inside & Outside

by Vitaly Horban [email protected]

Speech processing

Agenda

NNSU Lab Summer 2003 Seminar

   

Comparison Intel ® IPP 3.0 and speech libraries & toolkits Overview mathematical methods for speech processing General assessment of Intel ® IPP 3.0

Summary

               

Speech processing

Acronyms

LP RELP PLP AR LSP LSF MFCC MLSA DCT DTW SVD VQ RFC HMM ANN EM Linear Prediction Residual Linear Prediction Perceptual Linear Prediction Area Ratios or Autoregressive Line Spectrum Pairs Line Spectral Frequencies Mel-Frequency Cepstrum Coefficients Mel Log Spectral Approximation Discrete Cosine Transform Dynamic Time Warping Single Value Decomposition Vector Quantization Rise/Fall/Connections Hidden Markov Model Artificial Neural Network Expectation/Maximization NNSU Lab Summer 2003 Seminar

               

Speech processing

Acronyms

(continue)

CMS MLP LDA QDA NLDA SVM DWT LAR PLAR GMM WFST CART HNM MBR SR TTS Cepstral Mean Subtraction Multi Layer Perception Linear Discriminant Analysis Quadratic Discriminant Analysis Non-Linear Discriminant Analysis Support Vector Machine Discrete Wavelet Transformation Log Area Ratio Pseudo Log Area Ratio Gaussian Mixture Model Weighted Finite State Transducer Classification and Regression Trees Harmonic plus Noise Modeling Minimum Bayes Risk Speech Recognition Text-To-Speech synthesis NNSU Lab Summer 2003 Seminar

 

Speech processing

IPP vs. CMU Sphinx

Feature processing



  

Power Spectrum Cepstrum LSP

   

Mel-scale values Mel-frequency filter bank Mel-cepstrum Linear scale values Acoustic & Language models



Gaussian mixture

 

Likelihood of an HMM state cluster HMM transition matrix

  

NNSU Lab Summer 2003 Seminar Feature processing



Spectrum

 

Cepstrum MEL: filter, cepstrum, filter bank



PLP: filter, cepstrum, filter bank Language model

 

Context-free grammar N-gram model Acoustic model based on HMM



Each HMM state – set of Gaussian mixture



HMM order

  

HMM position HMM transition matrix Baum-Welch training

 

Speech processing

IPP vs. CSLU Toolkit

Feature processing



Power Spectral analysis (FFT)

  

Linear predictive analysis (LPC) LP reflection coefficients LSP

   

DCT RFC Cross correlation coefficients Covariance matrix

  

Mel-scale cepstral analysis Derivative functions Energy normalization Acoustic & Language model



 

Weights, Means and Variances EM re-estimation Viterbi decoding

  

NNSU Lab Summer 2003 Seminar Feature processing



Power spectral analysis (FFT)



Linear predictive analysis (LPC)

   

PLP Mel-scale cepstral analysis (MEL) Relative spectra filtering of log domain coefficients (RASTA) First order derivative (DELTA)



Energy normalization Language model

  

Word pronunciation Lexical trees Grammars Acoustic model based on HMM/ANN



VQ initialisation

 

EM training Viterbi decoding

 

Speech processing

IPP vs. Festival

NNSU Lab Summer 2003 Seminar Feature processing



Power Spectrum



Feature processing



Power spectrum

     

Reflection to Tilt, PitchmarkToF0, Unit Curve (RFC)

 

LPC



MEL



LSF



LP Reflection coefficients



Energy normalization



Acoustic & Language model



Tilt to RFC, RFC to Tilt, RFC to F0 LPC MEL LSF LP Reflection coefficients Fundamental frequency (pitch) Root mean square energy Language model



Viterbi decoding



N-gram model



Context-free grammar

   

WFST Acoustic model



Regular expressions CART trees Viterbi decoding

 

Speech processing

IPP vs. ISIP

Feature processing



Derivative functions

  

Spectrum Cepstrum Cross correlation

     

Covariance matrix Energy normalization Filter bank Area Ratio Durbin’s recursion Reflection coefficients (Schur)



Gaussian probability Acoustic & Language model



Viterbi decoding NNSU Lab Summer 2003 Seminar

 

Feature processing

  

Derivative functions Spectrum Cepstrum

  

Cross correlation Covariance matrix Covariance (Cholesky)

      

Energy (Log, dB, RMS, Power) Filter bank Log Area Ratio (Kelly-Lochbaum) Autocorrelation (Durbin recursion, Leroux-Guegen ) Lattice (Burg) Reflection coefficients Gaussian probability Acoustic & Language model (HMM)



N-gram model

 

Viterbi decoding Baum-Welch training

  

Speech processing

IPP vs. MATLAB

Frequency Scale Conversion

 

Mel scale Linear scale Transforms

  

DFT FFT DCT Distance



Euclidean

  

Mahalanobis DTW (observation and reference vector sequences) Bhattacharya

   

NNSU Lab Summer 2003 Seminar Frequency Scale Conversion

 

Mel scale Equivalent rectangular Bandwidths (ERB) Transforms

   

FFT (real data) DCT (real data) Hartley (real data) Diagonalisation of two Hermitian matrices (LDA, IMELDA) Vector distance



Euclidean



Squared Euclidean

  

Mahalanobis Itakura (AR, Power spectra) Itakura-Saito (AR, Power spectra)



COSH (AR, Power spectra) Speech enhancement



Martin spectral subtraction algorithm



Speech processing

IPP vs. MATLAB (continue)

     

Feature processing



LPC

  

Area ratio Spectrum Cepstrum

     

RFC DCT LSP LSF Reflection coefficients RFC Autocorrelation coefficients Cross correlation coefficients Covariance matrix Mel-scale cepstral analysis Derivative functions Energy normalization



NNSU Lab Summer 2003 Seminar LPC analysis and transforms



Area ratios



Autoregressive or AR

   

Power spectrum Cepstrum DCT Impulse response (IR)

      

LSP LSF Reflection coefficients Unit-triangular matrix containing the AR coefficients Autocorrelation coefficients Expand formant bandwidths of LPC filter Warp cepstral (Mel, Linear)

 

Speech processing

IPP vs. MATLAB (continue)

Speech Recognition



Feature processing

  

Model Evaluation Model Estimation Model Adaptation



Vector Quantization Speech coding (ITU G.711, G.723.1, G.729)

 

Linear PCM A-law

 

Mu-law VQ given codebook

  

NNSU Lab Summer 2003 Seminar Speech synthesis



Rosenberg glottal model



Liljencrants-Fant glottal model Speech Recognition



Mel-cepstrum

 

Mel-filter bank Cepstral & variances to power domain



Gaussian Mixture Speech coding (ITU G.711)

  

Linear PCM A-law Mu-law

 

VQ using K-means algorithm VQ using the Linde-Buzo-Gray algorithm



Speech processing

IPP vs. HTK

      

Feature processing



LPC

  

Area ratio Spectrum Cepstrum

    

RFC DCT LSP LSF Reflection coefficients Autocorrelation coefficients Cross correlation coefficients Covariance matrix Mel-scale cepstral analysis Derivative functions Energy normalization VQ NNSU Lab Summer 2003 Seminar



Feature processing



LPC



Spectral coefficients

   

Cepstral coefficients Reflection coefficients Gaussian distribution K-means procedure

       

PLP Autocorrelation coefficients Covariance matrix Mel-scale filter bank MFCC Third differential Energy VQ codebook

  

Speech processing

IPP vs. HTK (continue)

Model adaptation



EM training algorithm Acoustic & Language model



Viterbi decoding



Likelihood of an HMM state cluster



HMM transition matrix Speech coding (ITU G.711, G.723.1, G.729)

 

Linear PCM A-law

 

Mu-law VQ given codebook

  

NNSU Lab Summer 2003 Seminar Model adaptation



Maximum Likelihood Linear Regression (MLLR)

 

EM technique Bayesian adaptation or Maximum Aposteriori Approach (MAP) Acoustic & Language model based on HMM



Grammar



N-gram model

 

Viterbi training Baum-Welch training Speech coding



Linear PCM

 

A-law Mu-law



Speech processing

Possible extension IPP 3.0

           

Feature processing

  

PLP: filter, cepstrum, filter bank Relative spectra filtering of log domain coefficients (RASTA) Fundamental frequency (pitch) RMS energy Covariance (Cholesky) Energy (Log, dB, RMS, Power) LAR (Kelly-Lochbaum) Autocorrelation (Leroux-Guegen) Lattice (Burg) Equivalent Rectangular Bandwidths (ERB) Unit-triangular matrix (AR coef.) Expand formant bandwidths (LP) Third differential Hartley transform Diagonalisation of two Hermitian matrices (LDA, IMELDA)

     

NNSU Lab Summer 2003 Seminar Model adaptation

 

Maximum Likelihood Linear Regression (MLLR) Bayesian adaptation or Maximum Aposteriori Approach (MAP) Model evaluation



Itakura (AR, Power spectra)

 

Itakura-Saito (AR, Power spectra) COSH (AR, Power spectra) Speech synthesis

 

Rosenberg glottal model Liljencrants-Fant glottal model Speech enhancement



Martin spectral subtraction Speech coding



VQ using K-means algorithm



VQ using the Linde-Buzo-Graym Acoustic model based on HMM



Baum-Welch training



Speech processing

Speaker Characteristics

Feature processing

                

Preemphasize Cepstral Energy Cepstral Mean Subtraction (CMS) MFCC, LPCC, LFCC LPC (to Cepstral, to LSF) Residual Prediction Mel-cepstral Fundamental frequency (F0) LSF ( Bark scale ) RMS energy Levinson-Durbin recursion Covariance (Cholesky) Delta cepstral (Milner, High order) Pseudo Log Area Ratio (PLAR) DWT VQ

 

NNSU Lab Summer 2003 Seminar Acoustic model



Distance

         

Bhattacharya



DTW Euclidean Viterbi decoding EM ( Lloyd ) K-means (Lloyd) PLP MLP Twin-output MLP LDA NLDA Generative models



GMM



HMM ( Baum-Welch )

 

Speech processing

Speech Processing

Feature processing

     

LPC LSP F0 Levinson-Durbin recursion Tilt Gaussian Acoustic & Language model

   

Baum-Welch training Viterbi decoding CART Statistical language modeling

 

NNSU Lab Summer 2003 Seminar Speech enhancement Speech Analysis

   

Discrete Wigner Distribution DWT Pitch Determination Code Excited Linear Predictor (CELP)

 

Speech processing

Speech Recognition

Feature processing

       

DCT MFCC Mel-frequency log energy coefficients (MFLEC) Subband (SB-MFCC) CMS Within Vector Filtered (WVF-MFCC) Robust Formant (RF) algorithm Split Levinson Algorithm (SLA) Vector Quantization

  

VQ correlation Single VQ Joint VQ



NNSU Lab Summer 2003 Seminar Acoustic & Language model



Viterbi decoding

             

LDA QDA MLP PLP EM re-estimation Minimum Bayes Risk (MBR) Maximum Likelihood Estimation (MLE) NN (Elman predictive) HMM (Baum-Welch) GMM Buried Markov Model Decision tree state clustering WFST Dynamic Bayesian Networks



Speech processing

Speech Synthesis

Feature processing

              

MFCC Log Area Ratio (LAR) Bark frequency scale FFT Power Spectrum LPC LSF F0 Likelihood Ratio Residual LP Mel Log Spectral Approximation (MLSA) MLSA filter Covariance Energy Delta, DeltaDelta

 

NNSU Lab Summer 2003 Seminar Acoustic & Language model



Viterbi decoding

    

HMM (Baum-Welch) EM training WFST CART Harmonic plus Noise Modeling (HNM) Distance



Euclidean

     

Kullback-Leibler Mean Squared Log Spectral Distance (MS-LSD) Mahalanobis Itakura-Saito Symmetries Itakura RMS (root mean squared log spectral)



Speech processing

New Speech Functionality

Feature processing

               

Bark scale Fundamental frequency Likelihood Ratio Covariance (Cholesky) MLSA CMS SB-MFCC WVF-MFCC Robust formant algorithm Split Levinson algorithm LPCC LFCC RMS energy Delta cepstral (Milner, High order) Pseudo LAR PLP

 

NNSU Lab Summer 2003 Seminar Acoustic & Language model



HMM (Baum-Welch)

  

HNM MLP WFST

   

CART LDA, NLDA, QDA Minimum Bayes Risk (MBR) Maximum Likelihood Estimation

  

NN (Elman predictive) Discrete Wigner Distribution Code Excited Linear Predictor Distance



Kullback-Leibler



Mean Squared Log Spectral Distance (MS-LSD)

  

Itakura-Saito Symmetries Itakura RMS

Speech processing

Summary

NNSU Lab Summer 2003 Seminar

  

Intel ® IPP 3.0 now covers most useful primitives for speech processing Speech enabled applications require still more primitives Developers and researches need more samples

Speech processing

Thank You !

NNSU Lab Summer 2003 Seminar

Document

Transcript Document

Intel

®

Integrated Performance Primitives

vs. Speech Libraries & Toolkits Math Inside & Outside

by Vitaly Horban [email protected]

Agenda

Comparison Intel ® IPP 3.0 and speech libraries & toolkits Overview mathematical methods for speech processing General assessment of Intel ® IPP 3.0

Summary

Acronyms

Acronyms

(continue)

IPP vs. CMU Sphinx

IPP vs. CSLU Toolkit

IPP vs. Festival

IPP vs. ISIP

IPP vs. MATLAB

IPP vs. MATLAB (continue)

IPP vs. MATLAB (continue)

IPP vs. HTK

IPP vs. HTK (continue)

Possible extension IPP 3.0

Speaker Characteristics

Speech Processing

Speech Recognition

Speech Synthesis

New Speech Functionality

Summary

Intel ® IPP 3.0 now covers most useful primitives for speech processing Speech enabled applications require still more primitives Developers and researches need more samples

Thank You !

Vitaly Horban [email protected]

Directory