Transcript Document
A Spectral-Temporal Method for Pitch
Tracking
Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu*
Department of Electrical and Computer Engineering
Old Dominion University, Norfolk, VA 23529, USA.
* Currently at Binghamton University
09/17/2006
1
Outline
Introduction
Algorithm
Algorithm overview
The use of nonlinear processing
Pitch tracking from the spectrum
Experimental evaluation
Conclusion
2
Introduction
Pitch(the fundamental frequency) applications
Automatic speech recognition (ASR), speech synthesis,
speech articulation training aids, etc.
Pitch detection algorithms
“Robust and accurate fundamental frequency estimation
based on dominant harmonic components,” Nakatani, etc
=> High accuracy for noisy speech reported using the harmonic
dominance spectrum
“Yet another algorithm for pitch tracking(YAAPT),”
Zahorian, etc
=> Hybrid spectral-temporal processing for pitch tracking
3
Algorithm Overview
4
The Use of Nonlinear Processing
Restoration of missing fundamental in telephone speech
A periodic sound is characterized by the spectrum of its
harmonics
The signal the fundamental missed be approximated as
y(t ) b1 cos(t ) b2 cos(2t ) b3 cos(3t )
Fundamental
1st harmonic
2nd harmonic
After squaring and applying trigonometric identities
y t b
2
2
2
b3 2
2
b b cost b cos4t
2 3
2
b2b3 cos5t
2
2
b3 2
2
cos6t
The fundamental reappears
5
Illustration of Nonlinear Processing
The telephone speech signal (top panel) and squared
telephone signal (bottom panel) for one frame
6
Illustration of Nonlinear Processing
The magnitude spectrum for the telephone (top panel) and
nonlinear processed signal (bottom panel)
7
Spectral Effects from Nonlinear Processing
The missing fundamental in the telephone speech (top panel)
is restored in the squared signal (bottom panel)
Spectrum of the telephone speech
Frequency (Hz)
400
300
200
100
18
18.5
19
19.5
20
20.5
Time (Seconds)
21
21.5
22
22.5
23
21.5
22
22.5
23
Spectrum of the nonlinear processed signal
Frequency (Hz)
400
300
200
100
18
18.5
19
19.5
20
20.5
Time (Seconds)
21
8
Pitch Tracking From the Spectrum
The pitch track from the spectrum refines the pitch
candidates estimated from the temporal method
To achieve a noise robust pitch track from the
spectrum, an autocorrelation type of function is
proposed
9
0.2
0.15
0.1
Autocorrelation type of Function
0.05
0
0
200
400
600
Frequency (Hz)
800
1000
The function takes into account multiple harmonics
Autocorrelation type of function
Spectrum
1
0.2
0.8
0.15
k
2k
0.1
X
4k
3k
X
0.6
X
0.4
0.05
0
0.2
0
0
100
200
WL
Equation
300
400
500
600
Frequency (Hz)
700
800
900
1000
0
100
200
Frequency (Hz)
300
400
Autocorrelation type of function
1
0.8
y (k )
0.6
WL / 2
N 1
f (nk i)
i WL / 2 n 1
0.4
0.2
f (i ) : The spectrum,
N : The
0
0
50
k : Frequency index, kF 0 _ min k kF 0 _ max
number of harmonics (3),
100
150
200
250
Frequency (Hz)
300
350
400
WL: Window length (20Hz)
10
Peaks in Autocorrelation Type of Function
Spectrum
Amplitude
0.4
0.3
0.2
0.1
0
0
200
400
600
800
Frequency(Hz)
Peaks in autocorrelation type of function
1000
1200
Amplitude
1
0.5
0
0
50
100
150
200
250
Frequency(Hz)
300
350
400
450
A very prominent peak is observed in the proposed function
11
Candidate Insertion to Reduce Pitch
Doubling/Halving
If all candidates are larger than a threshold (typically 150
Hz), an additional candidate is inserted at half the frequency
of the highest-ranking candidate
Similar logic is used to reduce pitch halving
Peaks in autocorrelation type of function
1
Amplitude
P2(Hz)=P1(Hz)/2
P1
0.5
0
0
50
100
150
200
250
Frequency(Hz)
300
350
400
12
Experimental Evaluation
Database
Keele pitch extraction database
5 male and 5 female speakers, about 35seconds speaker
High quality speech and telephone speech
Additive Gaussian noise
Controls (reference pitch)
Control C1: supplied in Keele database
Control C2: computed from the laryngograph signal
with the proposed algorithm
13
Definition of Error Measures
Gross error
The percentage of frames such that the pitch estimate of
the tracker deviates significantly (typically 20%) from
the reference pitch (control)
Only evaluated in the voiced sections of the reference
14
Experiment 1 Results
Individual performance of the proposed algorithm
Control
Studio,
Clean (%)
Studio,
Telephone, Telephone,
5dB Noise(%) Clean (%) 5dB Noise(%)
C1
4.26
7.62
8.14
17.85
YAAPT* C1
1.59
1.99
2.69
4.48
Spectral
method
C1
4.23
4.45
6.52
6.95
NCCF
C1
3.58
4.52
8.00
16.61
YAAPT
YAAPT*: Using control C1 for the spectral pitch track
NCCF : Normalized cross correlation function, used as the temporal
method in YAPPT
15
Experiment 2 Results
The results of the new method with various error thresholds
Error
Control
Threshold
Studio,
Clean (%)
Studio,
Telephone, Telephone,
5dB Noise(%) Clean (%) 5dB Noise(%)
10%
C1
5.46
7.31
9.39
16.14
10%
C2
4.18
6.06
7.77
14.78
20%
C1
2.90
3.65
4.86
7.45
20%
C2
1.56
2.16
3.27
5.85
40%
C1
2.25
2.44
2.75
3.63
40%
C2
0.91
1.06
0.99
2.05
16
Comparisons
Studio,
Clean (%)
Studio,
Telephone, Telephone,
5dB Noise(%) Clean (%) 5dB Noise(%)
Proposed
C1
Method
2.90
3.65
4.86(4.52 *) 7.45(5.90 *)
DASH
C1
2.81
2.32
3.73*
4.15 *
REPS
C1
2.68
2.98
6.91*
8.49 *
YIN
C1
2.57
7.22
7.55*
14.6*
Control
DASH, REPS, YIN: the results are reported in “Robust and
accurate fundamental frequency estimation ... ,” Nakatani, etc.
*: SRAEN filter simulated telephone speech
17
Conclusion
A new pitch-tracking algorithm has been developed
which combines multiple information sources to
enable accurate robust F0 tracking
An analysis of errors indicates better performance
for both high quality and telephone speech than
previously reported performance for pitch tracking
Acknowledgements
This work was partially supported by JWFC 900
18