Document

Transcript Document

A Spectral-Temporal Method for Pitch
Tracking
Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu*
Department of Electrical and Computer Engineering
Old Dominion University, Norfolk, VA 23529, USA.
* Currently at Binghamton University
09/17/2006
1
Outline
 Introduction
 Algorithm



Algorithm overview
The use of nonlinear processing
Pitch tracking from the spectrum
 Experimental evaluation
 Conclusion
2
Introduction
 Pitch(the fundamental frequency) applications

Automatic speech recognition (ASR), speech synthesis,
speech articulation training aids, etc.
 Pitch detection algorithms

“Robust and accurate fundamental frequency estimation
based on dominant harmonic components,” Nakatani, etc
=> High accuracy for noisy speech reported using the harmonic
dominance spectrum

“Yet another algorithm for pitch tracking(YAAPT),”
Zahorian, etc
=> Hybrid spectral-temporal processing for pitch tracking
3
Algorithm Overview
4
The Use of Nonlinear Processing
 Restoration of missing fundamental in telephone speech
 A periodic sound is characterized by the spectrum of its
harmonics
 The signal the fundamental missed be approximated as
y(t )  b1 cos(t )  b2 cos(2t )  b3 cos(3t )
Fundamental

1st harmonic
2nd harmonic
After squaring and applying trigonometric identities

y t   b
2
2
2
 b3 2
2
  b b cost   b cos4t 
2 3
2
 b2b3 cos5t  
2
2
b3 2
2
cos6t 
The fundamental reappears
5
Illustration of Nonlinear Processing
 The telephone speech signal (top panel) and squared
telephone signal (bottom panel) for one frame
6
Illustration of Nonlinear Processing
 The magnitude spectrum for the telephone (top panel) and
nonlinear processed signal (bottom panel)
7
Spectral Effects from Nonlinear Processing
 The missing fundamental in the telephone speech (top panel)
is restored in the squared signal (bottom panel)
Spectrum of the telephone speech
Frequency (Hz)
400
300
200
100
18
18.5
19
19.5
20
20.5
Time (Seconds)
21
21.5
22
22.5
23
21.5
22
22.5
23
Spectrum of the nonlinear processed signal
Frequency (Hz)
400
300
200
100
18
18.5
19
19.5
20
20.5
Time (Seconds)
21
8
Pitch Tracking From the Spectrum
 The pitch track from the spectrum refines the pitch
candidates estimated from the temporal method
 To achieve a noise robust pitch track from the
spectrum, an autocorrelation type of function is
proposed
9
0.2
0.15
0.1
Autocorrelation type of Function
0.05
0
0
200
400
600
Frequency (Hz)
800
1000
 The function takes into account multiple harmonics
Autocorrelation type of function
Spectrum
1
0.2
0.8
0.15
k
2k
0.1
X
4k
3k
X
0.6
X
0.4
0.05
0
0.2
0
0
100
200
WL
 Equation
300
400
500
600
Frequency (Hz)
700
800
900
1000
0
100
200
Frequency (Hz)
300
400
Autocorrelation type of function
1
0.8
y (k ) 
0.6
WL / 2
N 1
  f (nk  i)
i  WL / 2 n 1
0.4
0.2
f (i ) : The spectrum,
N : The
0
0
50
k : Frequency index, kF 0 _ min  k  kF 0 _ max
number of harmonics (3),
100
150
200
250
Frequency (Hz)
300
350
400
WL: Window length (20Hz)
10
Peaks in Autocorrelation Type of Function
Spectrum
Amplitude
0.4
0.3
0.2
0.1
0
0
200
400
600
800
Frequency(Hz)
Peaks in autocorrelation type of function
1000
1200
Amplitude
1
0.5
0
0
50
100
150
200
250
Frequency(Hz)
300
350
400
450
A very prominent peak is observed in the proposed function
11
Candidate Insertion to Reduce Pitch
Doubling/Halving
 If all candidates are larger than a threshold (typically 150
Hz), an additional candidate is inserted at half the frequency
of the highest-ranking candidate
 Similar logic is used to reduce pitch halving
Peaks in autocorrelation type of function
1
Amplitude
P2(Hz)=P1(Hz)/2
P1
0.5
0
0
50
100
150
200
250
Frequency(Hz)
300
350
400
12
Experimental Evaluation
 Database




Keele pitch extraction database
5 male and 5 female speakers, about 35seconds speaker
High quality speech and telephone speech
Additive Gaussian noise
 Controls (reference pitch)


Control C1: supplied in Keele database
Control C2: computed from the laryngograph signal
with the proposed algorithm
13
Definition of Error Measures
 Gross error


The percentage of frames such that the pitch estimate of
the tracker deviates significantly (typically 20%) from
the reference pitch (control)
Only evaluated in the voiced sections of the reference
14
Experiment 1 Results
 Individual performance of the proposed algorithm
Control
Studio,
Clean (%)
Studio,
Telephone, Telephone,
5dB Noise(%) Clean (%) 5dB Noise(%)
C1
4.26
7.62
8.14
17.85
YAAPT* C1
1.59
1.99
2.69
4.48
Spectral
method
C1
4.23
4.45
6.52
6.95
NCCF
C1
3.58
4.52
8.00
16.61
YAAPT
YAAPT*: Using control C1 for the spectral pitch track
NCCF : Normalized cross correlation function, used as the temporal
method in YAPPT
15
Experiment 2 Results
 The results of the new method with various error thresholds
Error
Control
Threshold
Studio,
Clean (%)
Studio,
Telephone, Telephone,
5dB Noise(%) Clean (%) 5dB Noise(%)
10%
C1
5.46
7.31
9.39
16.14
10%
C2
4.18
6.06
7.77
14.78
20%
C1
2.90
3.65
4.86
7.45
20%
C2
1.56
2.16
3.27
5.85
40%
C1
2.25
2.44
2.75
3.63
40%
C2
0.91
1.06
0.99
2.05
16
Comparisons
Studio,
Clean (%)
Studio,
Telephone, Telephone,
5dB Noise(%) Clean (%) 5dB Noise(%)
Proposed
C1
Method
2.90
3.65
4.86(4.52 *) 7.45(5.90 *)
DASH
C1
2.81
2.32
3.73*
4.15 *
REPS
C1
2.68
2.98
6.91*
8.49 *
YIN
C1
2.57
7.22
7.55*
14.6*
Control
 DASH, REPS, YIN: the results are reported in “Robust and
accurate fundamental frequency estimation ... ,” Nakatani, etc.
 *: SRAEN filter simulated telephone speech
17
Conclusion
 A new pitch-tracking algorithm has been developed
which combines multiple information sources to
enable accurate robust F0 tracking
 An analysis of errors indicates better performance
for both high quality and telephone speech than
previously reported performance for pitch tracking
 Acknowledgements

This work was partially supported by JWFC 900
18