A further comparison of fundamental frequency tracking algorithms Hongbing Hu, Peter Guzewich, Stephen Zahorian Department of Electrical and Computer Engineering SUNY -- Binghamton University Electrical.

Download Report

Transcript A further comparison of fundamental frequency tracking algorithms Hongbing Hu, Peter Guzewich, Stephen Zahorian Department of Electrical and Computer Engineering SUNY -- Binghamton University Electrical.

A further comparison of fundamental
frequency tracking algorithms
Hongbing Hu, Peter Guzewich, Stephen Zahorian
Department of Electrical and Computer Engineering
SUNY -- Binghamton University
Electrical and Computer Engineering
Binghamton University, State University of New York
1
Outline

Review of Yet Another Algorithm for Pitch
Tracking (YAAPT)
 Modifications to YAAPT



Experimental evaluations



Speed/Accuracy improvements
Interpolations through unvoiced regions
Tracking accuracy
Tone classification for Mandarin
Summary
Electrical and Computer Engineering
Binghamton University, State University of New York
2
YAAPT
block diagram
Original
Speech
Filtered
Speech
Filtered
Squared Speech
Preprocessing (1)
FFT
Spectrum
F0 candidate
Estimation (3)
F0 candidate
Estimation (3)
F0 Tracking (2)
F0 candidates
F0 candidates
Spectral F0 track
Candidate Refinement (3)
Candidate Refinement (3)
Refined F0
Candidates
Refined F0
Candidates
Dynamic Programming (4)
Electrical and Computer Engineering
Binghamton University, State University of New York
Final F0
3
YAAPT example
The filtered original speech signal
Sorted pitch candidates
Frequency (Hz)
Amplitude
0.1
0.05
0
-0.05
-0.1
7.9
8
8.1
8.2 8.3
8.4 8.5 8.6 8.7
Time (Seconds)
Spectral pitch track and energy overlaid on spectrogram (nonlinear)
Pitch track
NLFER
400
300
200
100
0
7.8
7.9
8
8.1
8.2
8.3
8.4
Time (Seconds)
8.5
8.6
8.7
8.8
200
100
0
8.8
820 830
840 850
860 870
Number of frames
Final pitch track overlaid on the spectrogram of nonlinear signal
Frequency (Hz)
Frequency (Hz)
7.8
Cand.3
Cand.2
Cand.1
300
790
800
810
7.9
8
8.1
400
300
200
100
0
7.8
8.2
8.3
8.4
Time (Seconds)
8.5
8.6
8.7
Electrical and Computer Engineering
Binghamton University, State University of New York
4
880
Algorithm modifications

Subject to user-specified thresholds, some F0
values are considered “outliers” and
corrected
 Accuracy of spectral pitch track improved
 Inefficient inner loop changed for about 20%
reduction in computational time
 Eliminated some potential errors due to input
parameter settings out of range
Electrical and Computer Engineering
Binghamton University, State University of New York
5
Algorithm modifications (continued)

Biggest change is for case when final track is to
be considered all voiced
On first pass, “best” track is found with voicing
decisions made
 The voiced regions are used to compute a smooth
transition through unvoiced regions using third order
polynomial interpolation
 The original voiced sections are recombined with the
interpolated values with some additional smoothing
 Same parameter settings are now used for best track
with voicing decisions and best all-voiced track

Electrical and Computer Engineering
Binghamton University, State University of New York
6
Comparison
Pitch Tracking Original vs Revised
145
Original
Revised
Reference
Fundamental Frequency (Hz)
140
135
130
125
120
115
110
24.5 and Computer Engineering
25
Electrical
Binghamton University, State University of New York
25.5
Time (s)
26
26.5
7
Experimental Evaluation

Two databases

Keele database (5F, 5M, ~6 minutes, British English)
 Japanese database (14F, 14 M, ~40 minutes)

Three pitch trackers

YAAPT
 PRATT
 YIN

Two error measures

GROSS error (based on voiced regions only)
 BIG error (voiced region errors + voicing decision errors)

Multiple bandwidths, noise types, and noise levels
Electrical and Computer Engineering
Binghamton University, State University of New York
8
Experimental results--Keele database
Gross Error
Method
YAAPT
PRAAT
YIN
Clean
3.07
5.22
2.95
Big Error
Method
YAAPT
PRAAT
YIN
Clean
6.14
8.72
-
Studio
W-5
3.44
7.79
4.57
Studio
W-5
8.06
19.90
-
B-5
7.87
17.23
14.82
Simulated telephone
Clean
W-5
B-5
4.56
6.37
28.23
11.18
14.27
29.84
21.07
27.30
38.52
B-5
21.71
34.20
-
Simulated telephone
Clean
W-5
B-5
14.04
16.82
43.80
15.34
21.29
47.45
-
Electrical and Computer Engineering
Binghamton University, State University of New York
9
Experimental results--Japanese database
Gross Error
Method
YAAPT
PRAAT
YIN
Clean
1.83
4.08
1.69
Big Error
Method
YAAPT
PRAAT
YIN
Clean
4.99
7.12
-
Studio
W-5
2.87
5.93
2.92
Studio
W-5
7.16
17.89
-
B-5
3.96
15.44
13.03
Simulated telephone
Clean
W-5
B-5
4.43
7.27
24.37
6.38
11.20
28.72
14.46
20.96
34.51
B-5
15.33
30.96
-
Simulated telephone
Clean
W-5
B-5
12.16
17.10
35.07
10.06
23.26
43.81
-
Electrical and Computer Engineering
Binghamton University, State University of New York
10
Tone recognition experiments

Mandarin tones


Database


RASC863 (Shanghai region) database
5 feature sets (segment lengths: 100ms to 800ms)






4 tones (high, rising, dipping, falling) (neutral tone not
considered)
DCTC/DCSC features (35 features, 5 DCTCs x 7 DCSCs)
Pitch alone (7 features, encoded with 7 DCSC terms)
Normalized pitch (7 features)
DCTC/DCSCs + pitch (42 features)
DCTC/DCSCs + normalized pitch (42 features)
Neural network classifier
Electrical and Computer Engineering
Binghamton University, State University of New York
11
Automatic Tone Recognition
YAAPT
PRAAT
80
75
70
70
Accuracy %
75
65
60
55
65
60
55
50
50
100 ms
200 ms
400 ms
Segment Length
800ms
80
100 ms
YIN
YIN
200 ms
400 ms
Segment length
800ms
DCTC
75
Accuracy %
Accuracy %
80
P
70
NP
65
DCTC+P
60
DCTC+NP
55
50
Electrical and Computer Engineering
100 ms
Binghamton University, State University of New York
200 ms
400 ms
Segment Length
800ms
12
Automatic tone recognition--comments

Spectral shape trajectory features give
significant improvement to using pitch only
features
 Pitch normalization (based on mean and
standard deviation of pitch for each sentence)
gives large improvements for short segment
lengths, but not for long segment lengths
 Pitch tracks computed from YAAPT are
considerably more effective for tone recognition
than tracks computed from YIN or RAAPT—see
next slide for possible explanation
Electrical and Computer Engineering
Binghamton University, State University of New York
13
Comparison
Electrical and Computer Engineering
Binghamton University, State University of New York
14
Summary

YAAPT MATLAB Function
[Pitch, frms, rate] = yaapt(Data, Fs, VU, ExtrPrm… )

ExtrPrm can be used to adjust things such as
min and max search ranges for pitch—default
parameters used for all cases tested here
 YAAPT code is available at
www.ws.binghamton.edu/zahorian/yaapt.htm
Electrical and Computer Engineering
Binghamton University, State University of New York
15
Questions?
Electrical and Computer Engineering
Binghamton University, State University of New York
16
Backup
Electrical and Computer Engineering
Binghamton University, State University of New York
17
Spectral-Temporal Features

Discrete Cosine Transform Coefficients (DCTCs)




Spectral features
A midifed Cosine transform of log magnitude spectrum
Use a frequnecy warping to simulate the nonlinearity
of the human ear in speech perception (e.g., Mel Scale)
Discrete Cosine Series Coefficients (DCSCs)



Temporal features
A Cosine series expansion over time using overlapped
blocks of DCTCs
Capture the changes of each feature component from
frame to frame
Electrical and Computer Engineering
Binghamton University, State University of New York
18
DCTC Features

DCTC Computation

Given the spectrum X with the frequency f normalized
to a [0, 1] range, the ith DCTC is calculated:
0.15
1
DCTC(i)   a( X ( g ( f )))i ( f )df
0.1
0
BV0
a(X): nonlinear amplitude scaling
(log)
g(f): nonlinear frequency warping
(Mel-like function)
Amplitude
0.05
0
-0.05
-0.1
BV1
BV2
dg
Basis vector  i ( f )  cos[ig ( f )]
df
:
-0.15
-0.2
0
2
4
Frequency [kHz]
6
8
First 3 DCTC basis vectors
Electrical and Computer Engineering
Binghamton University, State University of New York
19
DCSC Features

DCSC Computation

Represent the temporal evolution of DCTCs over time
0.2
1
0.15
0
0.1
DCSC(i, j )   DCTC(i, h(t )) j (t )dt
h(t): time “warping” function—nonuniform time resolution
Amplitude
0.05
BV0
0
-0.05
BV2
-0.1
dh

(
t
)

cos[

ih
(
t
)]
Basis vectors: j
dt
Electrical and Computer Engineering
Binghamton University, State University of New York
BV1
-0.15
-0.2
-0.25
-60
-40
-20
0
20
Time [ms]
40
60
First 3 DCSC basis vectors
20