European SR Engine for Navi
Download
Report
Transcript European SR Engine for Navi
Multimedia Department
K.Marasek
05.07.2005
Fundamentals of speech production and
analysis
Speech production and analysis: Web tutorium
Multimedia Department
Speech production
Basic speech units: phoneme, syllable, word, phrase, sentence, speaking
turn
phone: subphonetic units, diphone, triphone, syllable as recognition units
types of sounds:
manner and place (constriction of vocal tract) of articulation,
vowels and consonants:
sonorants (vowels, diphtongs, glides, liquides, nasals)
obstruents (stops, fricatives, affricates)
consonants classification depending on vocal tract configuration:
labials, dentals, alveolars, palatals, glottals and pharingeals
transient sounds (diphtongs, glides, stops and affricates) and continuant sounds
vowels: front, back, middle, low, high - vowels rectangle
IPA chart
coarticulation
prosodic features: sentence intonation and word stress
voice quality and paralinguistic features
time and frequency features:
formants and duration, wide- and narrow-band spectrograms
K.Marasek
05.07.2005
Principles of speech analysis
Speech detection: remove silence and noise
signal preprocessing and conditioning:
pre-emphasis to enhance speech signal at higher frequencies H(z)=1-az-1, a=0.95
high-pass filtering
spectral analysis
Multimedia Department
short -time Fourier transform (STFT)
j
S (n, e )
j m
s
(
m
)
w
(
n
m
)
e
m
where w(n-m) is a window sequence for observation of n-th time instant
window is usually tapered to avoid effects of multiplication in time domain, so called
convolution x y X Y ex. Hamming window
frequency and time resolution trade-off (FFT principle)
vector of coefficients as an output, magnitude in log-scale considered only: power
spectrum
side-effects: spectral leakage, picket-fence effect etc., biased estimator of PFD
do not fit to F0: fluctuations, pitch synchronous analysis
spectrograms reading: exercises
K.Marasek
05.07.2005
Other methods of speech analysis
Time-frequency distributions:
1
t
t
j (qt t t q u )
*
Ps (t , )
e
f (q ,t ) s (u )(u )dud t dq
4
2
2
where f(q,t) is the kernel function defining smoothing properties of the TFDs: WignerVille, Rihacek, and others
1
j t
2
|
S
(
t
,
)
|
|
e
w
(
t
t
)
s
(
t
)
d
t
|
spectrogram is a special case of TFD
2
no trade-off between time and frequency resolution - limited only by Heisenberg’s
uncertainty principle (sampling frequency), but interference between signal
components
wavelet transform: future analysis tool? Non-uniform sampling of time-frequency
plane
Filter bank analysis:
the most specific cues of the signal are located in specific frequency bands
Multimedia Department
Filter bank for telephony
FIR-filters better (linear phase), but can be very long, IIR shorter, usually filtering in
frequency domain used,
powerful enough for small vocabulary application: ex. 7 bands for DTW 60 words
recognizer
K.Marasek
05.07.2005
Wigner-Ville distribution
Multimedia Department
/a/ vowel
K.Marasek
05.07.2005
Wigner-Ville distribution
WV, log. scale, imagesc, Threshold=0.1%
/t/ stop
sig1=wavread('d:\pjwstk\charlotte\lectures\ata2.wav');
0.45
plot(sig1);
0.4
tfrwv(sig1);
0.3
Frequency [Hz]
Multimedia Department
0.35
0.25
0.2
0.15
0.1
0.05
0
100
200
300
0.2
400
500
600
Time [s]
700
800
900
1000
0.1
0
-0.1
-0.2
K.Marasek
05.07.2005
0
200
400
600
800
1000
1200
Linear Predictive Coding (LPC)
Wiener (1966), Markel and Gray (1976), Makhoul (1973)
ARMA model of a process:
bz
H ( z)
1 a z
s ( n) a s ( n k ) b u ( n k )
k
q
k 0 k
p
k 1
Multimedia Department
p
k 1
k
k
q
k
k 0
k
where p and q are model orders of pole and zero filters, and a and b represent sets of
coefficients
LPC=AR, in order to compute coefficients is necessary to define the prediction error, so
p
called residual signal:
e(n) s(n) ak s(n k )
k 1
the coefficients of the filter can be than computed applying last-square criterion to
minimize a total squared error E
e2 (n)
n
once the predictor coefficients have been estimated, the e(n) signal can be used for a
perfect signal reconstruction
K.Marasek
05.07.2005
LPC
speech synthesis application:
Multimedia Department
LPC Synthesis
critical: model order, quantization of parameters and excitation signal
computation of coefficients: many methods, usually autocorrelation or auto-covariance
features of LPC:
modeling of peaks of the spectrum: good for formant frequency and bandwidth estimation
smoothed spectrum - spectral envelope
acoustic model of a tube with p/2 cylindrical sections
model order: rule of thumb: sampling frequency in kHz + 2
SVD for model order estimation
application in speech recognition: signal parametrization, but not commonly used
RASTA filtering for noisy signals
exercises: LPC analysis using Praat
K.Marasek
05.07.2005
LPC-based coeffcients
Usually not LPC coefficients are used, rather derivates
Multimedia Department
reflection coefficients: directly obtainable during LPC computations (Levinson-Durbin
recursion)
E(I) is the total prediction error at the i-th recursion step and al(I) is the l-th coefficient.
Let E(0)=R(0) where R(i) is i-th autocorrelation coeffcient, then recursively for i=1…p
ki
R (i )
l1ali1R (i l )
i 1
E ( i1)
ai(i ) ki
al(i ) al(i 1) ki ai(i l 1) 1 l i 1
E (i ) (1 ki2 ) E (i 1)
where ki denote the reflection coefficient (PARCOR), k<1
acoustic tube model: let Ai be the cross-section of i-th segment; then for neighboring
sections holds:
A
1 k
gi
i 1
Ai
i
1 ki
, 1 i p
line spectral frequencies: poles of AR filter:concentration of two or more LSFs in a
narrow frequency interval indicates the presence of a resonance in the LPC spectrum
k
LPC cepstral coeffcients ( cn an nk11 ck ank
), Mel-based possible,
n
perceptual LPC (PLP, Hermansky), using hearing properties, effective for noisy data
K.Marasek
05.07.2005
LPC
Multimedia Department
K.Marasek
05.07.2005
Vowel LPC spectrum for vario
model orders
Homomorphic cepstral analysis
Signal decomposition into components having different spectral charcteristics
the objective is to decompose given signal s(n) into source e(n) and vocal tract h(n)
components: s(n)=e(n)*h(n) (*-convolution), what in frequency domain equals to
S (e
j k
) E(e
Multimedia Department
j
K.Marasek
05.07.2005
j k
) H (e
j k
)
j
j
taking log one gets: log(| S (e k ) |) log(| E(e k ) |) log(| H (e k ) |)
the frequency response of the vocal tract log(|H|) is a slowly varying component and
represents the envelope of log(|S|), while log(|E|) is rapidly varied excitation
component:
the components can be separated in the log spectral domain by computing IFFT and
retaining lowest order coefficients to account for the vocal-tract transfer function
inverse Fourier transform of log(|S|) is called cepstrum (real cepstrum, exists also
complex cepstrum)
Block diagram of homomorphic analysis
Multimedia Department
Cepstral Analysis and Auditory Models
Cepstrally smoothed spectrum: examples
widely used in pattern-matching problems, because Euclidean distance between two
cepstral vectors represents a good measure for comparing log-spectra
Auditory Models
separating the message from surounding noise
modeling of output from cochlea
bark or mel scale of frequency axis: linear to ca. 1000 Hz, logarithmic above
Acoustic features for SR
static: short time interval (20-50 ms)
dynamic: change of parameters
The features describe Front-End of the recognizer
K.Marasek
05.07.2005
Filter bank based coefficients
Multimedia Department
Reduce the dimensionality of spectral signal representation
fundamental decisions: structure of the filter bank: number of filters, their response
and spacing in frequency
symmetric triangular filter used to weight DFT values: “quick and dirty” approximation of
band-pass filtering
Example of a filter bank (24 triangular filters) spaced according to Mel-scale
K.Marasek
05.07.2005
Mel based cepstral coeffcients (MFCC), most popular in ASR: usually computed as
IFFT of log-energy output of filter bank consisting of i triangular filter masks:
I
cn log |S ' (i) | cos[n(i 1 / 2)
i 1
], 1 n M
I
C0 approximates log-energy of the signal, higher order coefficients represents logenergy ratio between bands (i.e. c1 provides log-energy ratio between intervals [0,Fs/4]
and [Fs/4, Fs/2]- higher for sonorants, lower for fricatives), but for higher order
coefficients interpretation is complicated
IFFT is orthogonal transform, i.e. coeffcients are uncorrelated -> simplified acoustic
models can be used
MFCC speech reconstruction (IBM, ICASSP-2000)
Fundamental Frequency and Formants
F0 estimation: (Hess) determining the main period in quasi-periodic waveform
Multimedia Department
usually using autocorrelation function and the average magnitude difference function
(AMDF)
where L is the frame length N pis
1
AMDFt (m)
| st (n) st (n m) |, 0 n m L 1
number of point pairs
N p n,m
(peak in ACF and valley in AMDF indicates F0)
usually speech signal is first low-pass filtered to avoid influence of formants
cepstral analysis: peak at T0
Formant ferquency estimation:
K.Marasek
05.07.2005
resonances in vocal tract are related to complex poles of LPC model zk=Re(zk)+jIm(zk)
Fs
log(| z k |)
F
Im(z k )
Bk s arctan(
)
2
Re( z k )
Fk
cepstral smoothed spectrum also used
a lot of methods, but..
tracking of formant frequencies is a problem not solved yet
Dynamic features
Temporal variation and contextual dependency
time derivative features
Multimedia Department
K.Marasek
05.07.2005
not sensitive to slow channel-dependent variations of static parameters
first order difference is affected by various types of noise, thus smoothing necessary
polynomial expansion of time derivatives (Furui)
second order derivatives: acceleration also often used
Typical set of parameters: E,12 MFCC, DE, DMFCC, DDE, DD MFCC: observation
vector consists of 39 parameters
Other types of dynamic features:
spectral variation function
dynamic cepstrum
Karhunen-Loeve Transformation (KLT): segmenting speech into subword units depending only
on acoustic properties without a priori defined units, like phonemes
RASTA processing - band-pass filtering
K.Marasek
05.07.2005
Multimedia Department