European SR Engine for Navi

Download Report

Transcript European SR Engine for Navi

Multimedia Department
K.Marasek
05.07.2005
Fundamentals of speech production and
analysis
Speech production and analysis: Web tutorium
Multimedia Department
 Speech production
 Basic speech units: phoneme, syllable, word, phrase, sentence, speaking
turn
 phone: subphonetic units, diphone, triphone, syllable as recognition units
 types of sounds:
 manner and place (constriction of vocal tract) of articulation,
 vowels and consonants:
 sonorants (vowels, diphtongs, glides, liquides, nasals)
 obstruents (stops, fricatives, affricates)
 consonants classification depending on vocal tract configuration:
 labials, dentals, alveolars, palatals, glottals and pharingeals







transient sounds (diphtongs, glides, stops and affricates) and continuant sounds
vowels: front, back, middle, low, high - vowels rectangle
IPA chart
coarticulation
prosodic features: sentence intonation and word stress
voice quality and paralinguistic features
time and frequency features:
 formants and duration, wide- and narrow-band spectrograms
K.Marasek
05.07.2005
Principles of speech analysis
 Speech detection: remove silence and noise
 signal preprocessing and conditioning:
 pre-emphasis to enhance speech signal at higher frequencies H(z)=1-az-1, a=0.95
 high-pass filtering
 spectral analysis
Multimedia Department
 short -time Fourier transform (STFT)
j
S (n, e ) 

 j m
s
(
m
)
w
(
n

m
)
e

m  
 where w(n-m) is a window sequence for observation of n-th time instant
 window is usually tapered to avoid effects of multiplication in time domain, so called
convolution x  y  X  Y ex. Hamming window
 frequency and time resolution trade-off (FFT principle)
 vector of coefficients as an output, magnitude in log-scale considered only: power
spectrum
 side-effects: spectral leakage, picket-fence effect etc., biased estimator of PFD
 do not fit to F0: fluctuations, pitch synchronous analysis
 spectrograms reading: exercises
K.Marasek
05.07.2005
Other methods of speech analysis
 Time-frequency distributions:
1
t
t
 j (qt t t  q u )
*
Ps (t ,  ) 
e
f (q ,t ) s (u  )(u  )dud t dq

4
2
2
 where f(q,t) is the kernel function defining smoothing properties of the TFDs: WignerVille, Rihacek, and others
1
 j t
2
|
S
(
t
,

)
|

|
e
w
(
t

t
)
s
(
t
)
d
t
|
 spectrogram is a special case of TFD
2
 no trade-off between time and frequency resolution - limited only by Heisenberg’s
uncertainty principle (sampling frequency), but interference between signal
components
 wavelet transform: future analysis tool? Non-uniform sampling of time-frequency
plane
 Filter bank analysis:
 the most specific cues of the signal are located in specific frequency bands
Multimedia Department

Filter bank for telephony
 FIR-filters better (linear phase), but can be very long, IIR shorter, usually filtering in
frequency domain used,
 powerful enough for small vocabulary application: ex. 7 bands for DTW 60 words
recognizer
K.Marasek
05.07.2005
Wigner-Ville distribution
Multimedia Department
 /a/ vowel
K.Marasek
05.07.2005
Wigner-Ville distribution
WV, log. scale, imagesc, Threshold=0.1%
 /t/ stop
 sig1=wavread('d:\pjwstk\charlotte\lectures\ata2.wav');
0.45
 plot(sig1);
0.4
 tfrwv(sig1);
0.3
Frequency [Hz]
Multimedia Department
0.35
0.25
0.2
0.15
0.1
0.05
0
100
200
300
0.2
400
500
600
Time [s]
700
800
900
1000
0.1
0
-0.1
-0.2
K.Marasek
05.07.2005
0
200
400
600
800
1000
1200
Linear Predictive Coding (LPC)
 Wiener (1966), Markel and Gray (1976), Makhoul (1973)
 ARMA model of a process:
 bz
H ( z) 
1  a z
s ( n)   a s ( n  k )   b u ( n  k )
k
q
k 0 k
p
k 1
Multimedia Department
p
k 1
k
k
q
k
k 0
k
 where p and q are model orders of pole and zero filters, and a and b represent sets of
coefficients
 LPC=AR, in order to compute coefficients is necessary to define the prediction error, so
p
called residual signal:
e(n)  s(n)   ak s(n  k )
k 1
 the coefficients of the filter can be than computed applying last-square criterion to
minimize a total squared error E 
e2 (n)

n
 once the predictor coefficients have been estimated, the e(n) signal can be used for a
perfect signal reconstruction
K.Marasek
05.07.2005
LPC
 speech synthesis application:
Multimedia Department
LPC Synthesis
 critical: model order, quantization of parameters and excitation signal
 computation of coefficients: many methods, usually autocorrelation or auto-covariance
 features of LPC:





modeling of peaks of the spectrum: good for formant frequency and bandwidth estimation
smoothed spectrum - spectral envelope
acoustic model of a tube with p/2 cylindrical sections
model order: rule of thumb: sampling frequency in kHz + 2
SVD for model order estimation
 application in speech recognition: signal parametrization, but not commonly used
 RASTA filtering for noisy signals
 exercises: LPC analysis using Praat
K.Marasek
05.07.2005
LPC-based coeffcients
 Usually not LPC coefficients are used, rather derivates
Multimedia Department
 reflection coefficients: directly obtainable during LPC computations (Levinson-Durbin
recursion)
 E(I) is the total prediction error at the i-th recursion step and al(I) is the l-th coefficient.
Let E(0)=R(0) where R(i) is i-th autocorrelation coeffcient, then recursively for i=1…p
ki 
R (i )
l1ali1R (i l )
i 1
E ( i1)
ai(i )  ki
al(i )  al(i 1)  ki ai(i l 1) 1  l  i  1
E (i )  (1  ki2 ) E (i 1)
 where ki denote the reflection coefficient (PARCOR), k<1
 acoustic tube model: let Ai be the cross-section of i-th segment; then for neighboring
sections holds:
A
1 k
gi 
i 1
Ai

i
1  ki
, 1 i  p
 line spectral frequencies: poles of AR filter:concentration of two or more LSFs in a
narrow frequency interval indicates the presence of a resonance in the LPC spectrum
k
 LPC cepstral coeffcients ( cn  an  nk11  ck ank
), Mel-based possible,
n
 perceptual LPC (PLP, Hermansky), using hearing properties, effective for noisy data
K.Marasek
05.07.2005
LPC
Multimedia Department

K.Marasek
05.07.2005
Vowel LPC spectrum for vario
model orders
Homomorphic cepstral analysis
 Signal decomposition into components having different spectral charcteristics
 the objective is to decompose given signal s(n) into source e(n) and vocal tract h(n)
components: s(n)=e(n)*h(n) (*-convolution), what in frequency domain equals to
S (e
j k
)  E(e
Multimedia Department
j
K.Marasek
05.07.2005
j k
) H (e
j k
)
j
j
 taking log one gets: log(| S (e k ) |)  log(| E(e k ) |) log(| H (e k ) |)
 the frequency response of the vocal tract log(|H|) is a slowly varying component and
represents the envelope of log(|S|), while log(|E|) is rapidly varied excitation
component:
 the components can be separated in the log spectral domain by computing IFFT and
retaining lowest order coefficients to account for the vocal-tract transfer function
 inverse Fourier transform of log(|S|) is called cepstrum (real cepstrum, exists also
complex cepstrum)
Block diagram of homomorphic analysis
Multimedia Department
Cepstral Analysis and Auditory Models
 Cepstrally smoothed spectrum: examples
 widely used in pattern-matching problems, because Euclidean distance between two
cepstral vectors represents a good measure for comparing log-spectra
 Auditory Models
 separating the message from surounding noise
 modeling of output from cochlea
 bark or mel scale of frequency axis: linear to ca. 1000 Hz, logarithmic above

Acoustic features for SR
 static: short time interval (20-50 ms)
 dynamic: change of parameters
 The features describe Front-End of the recognizer
K.Marasek
05.07.2005
Filter bank based coefficients
Multimedia Department
 Reduce the dimensionality of spectral signal representation
 fundamental decisions: structure of the filter bank: number of filters, their response
and spacing in frequency
 symmetric triangular filter used to weight DFT values: “quick and dirty” approximation of
band-pass filtering
 Example of a filter bank (24 triangular filters) spaced according to Mel-scale
K.Marasek
05.07.2005
 Mel based cepstral coeffcients (MFCC), most popular in ASR: usually computed as
IFFT of log-energy output of filter bank consisting of i triangular filter masks:
I
cn   log |S ' (i) | cos[n(i  1 / 2)
i 1

], 1  n  M
I
 C0 approximates log-energy of the signal, higher order coefficients represents logenergy ratio between bands (i.e. c1 provides log-energy ratio between intervals [0,Fs/4]
and [Fs/4, Fs/2]- higher for sonorants, lower for fricatives), but for higher order
coefficients interpretation is complicated
 IFFT is orthogonal transform, i.e. coeffcients are uncorrelated -> simplified acoustic
models can be used
 MFCC speech reconstruction (IBM, ICASSP-2000)
Fundamental Frequency and Formants
 F0 estimation: (Hess) determining the main period in quasi-periodic waveform
Multimedia Department
 usually using autocorrelation function and the average magnitude difference function
(AMDF)
where L is the frame length N pis
1
AMDFt (m) 
| st (n)  st (n  m) |, 0  n  m  L  1

number of point pairs
N p n,m
(peak in ACF and valley in AMDF indicates F0)
 usually speech signal is first low-pass filtered to avoid influence of formants
 cepstral analysis: peak at T0
 Formant ferquency estimation:
K.Marasek
05.07.2005
 resonances in vocal tract are related to complex poles of LPC model zk=Re(zk)+jIm(zk)
Fs
 log(| z k |)
F
Im(z k )
Bk  s arctan(
)
2
Re( z k )
Fk  
 cepstral smoothed spectrum also used
 a lot of methods, but..
 tracking of formant frequencies is a problem not solved yet
Dynamic features
 Temporal variation and contextual dependency
 time derivative features
Multimedia Department




K.Marasek
05.07.2005
not sensitive to slow channel-dependent variations of static parameters
first order difference is affected by various types of noise, thus smoothing necessary
polynomial expansion of time derivatives (Furui)
second order derivatives: acceleration also often used
 Typical set of parameters: E,12 MFCC, DE, DMFCC, DDE, DD MFCC: observation
vector consists of 39 parameters
 Other types of dynamic features:
 spectral variation function
 dynamic cepstrum
 Karhunen-Loeve Transformation (KLT): segmenting speech into subword units depending only
on acoustic properties without a priori defined units, like phonemes
 RASTA processing - band-pass filtering
K.Marasek
05.07.2005
Multimedia Department