Phillip Harrison - Acoustics Research Institute

Download Report

Transcript Phillip Harrison - Acoustics Research Institute

Formant Measurement Errors
From Real Speech
Philip Harrison
J P French Associates & University of York
IAFPA 20th Annual Conference
24th – 28th July 2011 – Vienna
1
Outline
• Motivation & background
• Formant measurement errors from
synthetic speech
• Formant measurement errors from real
speech – VTR database
–
–
–
–
Praat
Praat tracker
CAbS tracker
Published results – MSR & WaveSurfer
l • Discussion
2
2
Motivation & Background
l
• All measurements are subject to ‘error’
• An estimate of the error should accompany
all measurements
• Increasing use of formant measurements in
forensic casework – no errors quoted
• Significant problem – can’t obtain ‘ground
truth’ values from real speech to determine
3
error of formant measurement systems
• Potential solution – synthetic speech
3
Errors from Synthetic Speech
• Idealised synthetic male speaker
– 2,858 monophthongs over F1, F2 vowel space
– Specified F1 to F5 centre frequency & bandwidth
– Pulse train glottal source, range of F0s (70 – 190 Hz)
l
• Measured formants at different LPC orders (6 to
20) in Praat – Burg (LPC) analysis, not a tracker
• Calculation of error: Ferror = Fmeasured – Fspecified
• Analysis of errors for F1, F2 and F3 – error surface
plots, summary stats (including absolute error, 4
standard deviation – normal (Hz) & percentage)
4
Error Summary Results –
Absolute Error @ F0 = 100 Hz
F1 Abs
(SD) Hz
LPC 7 9.6
(10.4)
LPC 8 8.3
(8.8)
LPC 9 8.8
(8.7)
l
%
2.1
(2.3)
1.8
(1.9)
2.0
(2.0)
F2 Abs
(SD) Hz
31.6
(19.4)
18.3
(11.6)
8.0
(9.2)
%
2.1
(1.2)
1.3
(0.8)
0.6
(0.7)
F3 Abs
(SD) Hz
99.8
(36.9)
41.1
(17.9)
10.7
(8.3)
%
4.2
(1.4)
1.7
(0.7)
0.5
(0.3)5
5
Multiple Synthetic Speakers
• Variation both within and between real
speakers in many speech production
parameters – e.g. F0 range, F1-F2 vowel
space, formant bandwidths
• Single synthetic speaker unlikely to be
representative or capture variation
• Consider multiple synthetic speakers:
l
– Alternative specified F3 values – 8 speakers
6
– Alternative glottal source signals – 10 speakers
6
Multiple Synthetic Speakers –
Summary Results
• Alternative F3
– Negligible influence on F1, F2 errors
– Changes in F3 error surface – influenced by F3 surface
– F3 error dependent on location within F1, F2 space –
constant F3 speakers – high F1 & F2 -> larger F3 errors
• Glottal source signal
l
– Impact on error surfaces & performance – across all
formants – some better, some worse than baseline
7
– Localised regions with large errors – greater variation
in errors than baseline
7
Real Speech
• How do these results translate to real
speech?
• Can’t directly test real speech – reason for
using synthetic speech initially
• Compare overall performance of real and
synthetic speech…
8
l
8
VTR Database
l
• Database of hand-corrected vocal tract resonance
values (Deng et al 2006) – balanced subset of
TIMIT corpus – good quality digital recs
• 516 sentences – 186 speakers (113 male, 73
female) – 61,000 vowel frames, 6,600 vowel
tokens
• Similar method to synthetic speakers but frame
by frame measurements and token means across9
monophthongs & diphthongs
9
VTR Results
Frame
LPC
15
Frame ♂ 17
Frame ♀ 14
Token
15
F1
Abs
86
(124)
78
(108)
94
(141)
66
(89)
%
17
(27)
17
(25)
17
(26)
14
(18)
LPC
11
11
10
11
F2
Abs
201
(301)
177
(259)
225
(331)
161
(220)
%
13
(19)
12
(18)
13
(20)
10
(13)
LPC
10
11
10
11
F3
Abs %
217 9
(337) (14)
202 9
(306) (13)
228 9
(347)10 (14)
179 7
(249) (9)
Comparison with Synthetic
Speech
F1
LPC Abs %
Synth 8
Real
l
8.3
(8.8)
17 63
(84)
F2
LPC Abs %
1.8 9
(1.9)
14 11
(19)
8.0
(9.2)
151
(205)
F3
LPC Abs %
0.6 9
(0.7)
10 11
(14)
10.7
(8.3)
168
(235)
0.5
(0.3)
7
(10)
• Both speakers = male, monophthong token average
11
• Best performance of all real results shown
11
Can Results be Improved?
• Real speech results not as good as synthetic
speech
• But measurements so far made without any
‘intelligence’ in selection of values
• Praat standard formant measurement tool is not
a tracker
• Formant trackers attempt to select most likely
values based on criteria – bandwidth, centre
12
frequency, frame transitions
l
12
Trackers Tested
• Trackers
– Praat tracker – Viterbi algorithm, considers
centre frequency, bandwidth and frame
transitions
– CAbS tracker (Clermont et al 2007) – cepstral
compatibility between original signal and
candidate formants, plus continuity constraints
• ‘Default’ settings used
l
13
13
Praat Tracker Results
Frame
LPC
15
Tracker
Frame
Token
15
Tracker
Token
15
15
F1
Abs
86
(124)
55
(75)
66
(89)
46
(61)
%
17
(27)
11
(17)
14
(18)
10
(14)
LPC
11
11
11
11
F2
Abs
201
(301)
94
(163)
161
(220)
81
(141)
%
13
(19)
7
(15)
10
(13)
6
(12)
LPC
10
14
11
14
F3
Abs %
217 9
(337) (14)
179 8
(359) (19)
179 7
(249)14 (9)
162 8
(332) (18)
CAbS Tracker Results
Frame
LPC
15
Tracker
Frame
Token
14
Tracker
Token
14
15
F1
Abs
86
(124)
69
(139)
66
(89)
62
(132)
%
17
(27)
15
(34)
14
(18)
14
(33)
LPC
11
13
11
13
F2
Abs
201
(301)
122
(239)
161
(220)
115
(218)
%
13
(19)
8
(18)
10
(13)
8
(16)
LPC
10
12
11
11
F3
Abs %
217 9
(337) (14)
413 18
(544) (25)
179 7
(249)15 (9)
414 18
(512) (23)
Tracker Comparison
Frame Data
Praat
LPC
15
Praat
Tracker
15
CAbS
Tracker
14
WavSurf
MSR
F1
Abs
86
(124)
55
(75)
69
(139)
70
64
%
17
(27)
11
(17)
15
(34)
LPC
11
11
13
F2
Abs
201
(301)
94
(163)
122
(239)
94
105
%
13
(19)
7
(15)
8
(18)
LPC
10
14
12
F3
Abs %
217 9
(337) (14)
179 8
(359) (19)
413 18
(544)16 (25)
154
125
Discussion
l
• Even with a tracker real speech results not
as good as synthetic performance
• But VTR database not perfect
• Does allow comparison of trackers – no
obvious ‘winner’
• Even though best performance at different
LPC orders across F1, F2 & F3, results
similar enough to use same LPC order for 17
all formants
17
Further Questions…
• What is the variation across speakers and vowel
categories? Is it significant?
• What is the maximum acheivable performance?
• Is 10% error a realistic estimate?
– Possibly test more diverse synthetic speech
l
• Is 10% error acceptable?
• What impact does this have on LRs and other
numerical analyses (LTFAs)?
• Are trackers accurate enough to be used
18
unattended on large corpera? How much manual
intervention is necessary?
18
Questions
?
Thanks to Frantz Clermont, Peter French & Paul Foulkes
l
References:
Clermont, F., Harrison, P. & French, P. (2007) ‘Formant-pattern estimation guided by
cepstral compatibility’. Proceedings of IAFPA 2007 Annual Conference, Plymouth, UK.
Deng, L., Cui, X., Pruvenok, R., Huang, L., Momen, S., Chen, Y. and Alwan, A. “A database of
vocal tract resonance trajectories for research in speech processing,” Proceedings of19
the IEEE International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), Toulouse, France, May 2006.
19