A further comparison of fundamental frequency tracking algorithms Hongbing Hu, Peter Guzewich, Stephen Zahorian Department of Electrical and Computer Engineering SUNY -- Binghamton University Electrical.
Download ReportTranscript A further comparison of fundamental frequency tracking algorithms Hongbing Hu, Peter Guzewich, Stephen Zahorian Department of Electrical and Computer Engineering SUNY -- Binghamton University Electrical.
A further comparison of fundamental frequency tracking algorithms Hongbing Hu, Peter Guzewich, Stephen Zahorian Department of Electrical and Computer Engineering SUNY -- Binghamton University Electrical and Computer Engineering Binghamton University, State University of New York 1 Outline Review of Yet Another Algorithm for Pitch Tracking (YAAPT) Modifications to YAAPT Experimental evaluations Speed/Accuracy improvements Interpolations through unvoiced regions Tracking accuracy Tone classification for Mandarin Summary Electrical and Computer Engineering Binghamton University, State University of New York 2 YAAPT block diagram Original Speech Filtered Speech Filtered Squared Speech Preprocessing (1) FFT Spectrum F0 candidate Estimation (3) F0 candidate Estimation (3) F0 Tracking (2) F0 candidates F0 candidates Spectral F0 track Candidate Refinement (3) Candidate Refinement (3) Refined F0 Candidates Refined F0 Candidates Dynamic Programming (4) Electrical and Computer Engineering Binghamton University, State University of New York Final F0 3 YAAPT example The filtered original speech signal Sorted pitch candidates Frequency (Hz) Amplitude 0.1 0.05 0 -0.05 -0.1 7.9 8 8.1 8.2 8.3 8.4 8.5 8.6 8.7 Time (Seconds) Spectral pitch track and energy overlaid on spectrogram (nonlinear) Pitch track NLFER 400 300 200 100 0 7.8 7.9 8 8.1 8.2 8.3 8.4 Time (Seconds) 8.5 8.6 8.7 8.8 200 100 0 8.8 820 830 840 850 860 870 Number of frames Final pitch track overlaid on the spectrogram of nonlinear signal Frequency (Hz) Frequency (Hz) 7.8 Cand.3 Cand.2 Cand.1 300 790 800 810 7.9 8 8.1 400 300 200 100 0 7.8 8.2 8.3 8.4 Time (Seconds) 8.5 8.6 8.7 Electrical and Computer Engineering Binghamton University, State University of New York 4 880 Algorithm modifications Subject to user-specified thresholds, some F0 values are considered “outliers” and corrected Accuracy of spectral pitch track improved Inefficient inner loop changed for about 20% reduction in computational time Eliminated some potential errors due to input parameter settings out of range Electrical and Computer Engineering Binghamton University, State University of New York 5 Algorithm modifications (continued) Biggest change is for case when final track is to be considered all voiced On first pass, “best” track is found with voicing decisions made The voiced regions are used to compute a smooth transition through unvoiced regions using third order polynomial interpolation The original voiced sections are recombined with the interpolated values with some additional smoothing Same parameter settings are now used for best track with voicing decisions and best all-voiced track Electrical and Computer Engineering Binghamton University, State University of New York 6 Comparison Pitch Tracking Original vs Revised 145 Original Revised Reference Fundamental Frequency (Hz) 140 135 130 125 120 115 110 24.5 and Computer Engineering 25 Electrical Binghamton University, State University of New York 25.5 Time (s) 26 26.5 7 Experimental Evaluation Two databases Keele database (5F, 5M, ~6 minutes, British English) Japanese database (14F, 14 M, ~40 minutes) Three pitch trackers YAAPT PRATT YIN Two error measures GROSS error (based on voiced regions only) BIG error (voiced region errors + voicing decision errors) Multiple bandwidths, noise types, and noise levels Electrical and Computer Engineering Binghamton University, State University of New York 8 Experimental results--Keele database Gross Error Method YAAPT PRAAT YIN Clean 3.07 5.22 2.95 Big Error Method YAAPT PRAAT YIN Clean 6.14 8.72 - Studio W-5 3.44 7.79 4.57 Studio W-5 8.06 19.90 - B-5 7.87 17.23 14.82 Simulated telephone Clean W-5 B-5 4.56 6.37 28.23 11.18 14.27 29.84 21.07 27.30 38.52 B-5 21.71 34.20 - Simulated telephone Clean W-5 B-5 14.04 16.82 43.80 15.34 21.29 47.45 - Electrical and Computer Engineering Binghamton University, State University of New York 9 Experimental results--Japanese database Gross Error Method YAAPT PRAAT YIN Clean 1.83 4.08 1.69 Big Error Method YAAPT PRAAT YIN Clean 4.99 7.12 - Studio W-5 2.87 5.93 2.92 Studio W-5 7.16 17.89 - B-5 3.96 15.44 13.03 Simulated telephone Clean W-5 B-5 4.43 7.27 24.37 6.38 11.20 28.72 14.46 20.96 34.51 B-5 15.33 30.96 - Simulated telephone Clean W-5 B-5 12.16 17.10 35.07 10.06 23.26 43.81 - Electrical and Computer Engineering Binghamton University, State University of New York 10 Tone recognition experiments Mandarin tones Database RASC863 (Shanghai region) database 5 feature sets (segment lengths: 100ms to 800ms) 4 tones (high, rising, dipping, falling) (neutral tone not considered) DCTC/DCSC features (35 features, 5 DCTCs x 7 DCSCs) Pitch alone (7 features, encoded with 7 DCSC terms) Normalized pitch (7 features) DCTC/DCSCs + pitch (42 features) DCTC/DCSCs + normalized pitch (42 features) Neural network classifier Electrical and Computer Engineering Binghamton University, State University of New York 11 Automatic Tone Recognition YAAPT PRAAT 80 75 70 70 Accuracy % 75 65 60 55 65 60 55 50 50 100 ms 200 ms 400 ms Segment Length 800ms 80 100 ms YIN YIN 200 ms 400 ms Segment length 800ms DCTC 75 Accuracy % Accuracy % 80 P 70 NP 65 DCTC+P 60 DCTC+NP 55 50 Electrical and Computer Engineering 100 ms Binghamton University, State University of New York 200 ms 400 ms Segment Length 800ms 12 Automatic tone recognition--comments Spectral shape trajectory features give significant improvement to using pitch only features Pitch normalization (based on mean and standard deviation of pitch for each sentence) gives large improvements for short segment lengths, but not for long segment lengths Pitch tracks computed from YAAPT are considerably more effective for tone recognition than tracks computed from YIN or RAAPT—see next slide for possible explanation Electrical and Computer Engineering Binghamton University, State University of New York 13 Comparison Electrical and Computer Engineering Binghamton University, State University of New York 14 Summary YAAPT MATLAB Function [Pitch, frms, rate] = yaapt(Data, Fs, VU, ExtrPrm… ) ExtrPrm can be used to adjust things such as min and max search ranges for pitch—default parameters used for all cases tested here YAAPT code is available at www.ws.binghamton.edu/zahorian/yaapt.htm Electrical and Computer Engineering Binghamton University, State University of New York 15 Questions? Electrical and Computer Engineering Binghamton University, State University of New York 16 Backup Electrical and Computer Engineering Binghamton University, State University of New York 17 Spectral-Temporal Features Discrete Cosine Transform Coefficients (DCTCs) Spectral features A midifed Cosine transform of log magnitude spectrum Use a frequnecy warping to simulate the nonlinearity of the human ear in speech perception (e.g., Mel Scale) Discrete Cosine Series Coefficients (DCSCs) Temporal features A Cosine series expansion over time using overlapped blocks of DCTCs Capture the changes of each feature component from frame to frame Electrical and Computer Engineering Binghamton University, State University of New York 18 DCTC Features DCTC Computation Given the spectrum X with the frequency f normalized to a [0, 1] range, the ith DCTC is calculated: 0.15 1 DCTC(i) a( X ( g ( f )))i ( f )df 0.1 0 BV0 a(X): nonlinear amplitude scaling (log) g(f): nonlinear frequency warping (Mel-like function) Amplitude 0.05 0 -0.05 -0.1 BV1 BV2 dg Basis vector i ( f ) cos[ig ( f )] df : -0.15 -0.2 0 2 4 Frequency [kHz] 6 8 First 3 DCTC basis vectors Electrical and Computer Engineering Binghamton University, State University of New York 19 DCSC Features DCSC Computation Represent the temporal evolution of DCTCs over time 0.2 1 0.15 0 0.1 DCSC(i, j ) DCTC(i, h(t )) j (t )dt h(t): time “warping” function—nonuniform time resolution Amplitude 0.05 BV0 0 -0.05 BV2 -0.1 dh ( t ) cos[ ih ( t )] Basis vectors: j dt Electrical and Computer Engineering Binghamton University, State University of New York BV1 -0.15 -0.2 -0.25 -60 -40 -20 0 20 Time [ms] 40 60 First 3 DCSC basis vectors 20