Glimpsing Speech - University of Sheffield

Download Report

Transcript Glimpsing Speech - University of Sheffield

A glimpsing model of
speech perception
Martin Cooke & Sarah Simpson
Speech and Hearing Research
Department of Computer Science
University of Sheffield
http://www.dcs.shef.ac.uk/~martin
Motivation:
The nonstationarity ‘paradox’
speech technology
performance falls with the
nonstationarity of the noise
background …
Aurora eval
Simpson & Cooke (2003)
Motivation:
The nonstationarity ‘paradox’
speech technology
performance falls with the
nonstationarity of the noise
background …
Simpson & Cooke (2003)
… while listeners appear to prefer
a nonstationary background
(8-12 dB SRT gain)
Miller (1947)
Possible factors
In a 1-speaker background, listeners can …
• … employ organisational cues from the
background source to help segregate foreground
• … employ schemas for both foreground and
background
• … benefit from better glimpses of the speech target
but: multi-speaker backgrounds have certain advantages …
• … less chance of informational masking
• … easier enhancement algorithm
Glimpsing opportunities
Spectro-temporal glimpse densities
% of time-frequency
regions with a
locally-positive SNR
Glimpsing
Informal definition
a glimpse is some time-frequency region which contains a
reasonably undistorted ‘view’ of local signal properties
Precursors
•
•
•
•
•
•
Term used by Miller & Licklider (1950) to explain intelligibility of
interrupted speech
Related to ‘multiple looks’ model of Viemeister & Wakefield (1991) which
demonstrated ‘intelligent’ temporal integration of tone bursts
Assmann & Summerfield (in press) suggest ‘glimpsing & tracking’ as
way of understanding how listeners cope with adverse conditions
Culling & Darwin (1994) developed a glimpsing model to explain double
vowel identification for small ΔF0s
de Cheveigné & Kawahara (1999) can be considered a glimpsing model
of vowel identification
Close relation to missing data processing (Cooke et al, 1994)
Types of glimpses
Comodulated
Eg Miller & Licklider (1950)
Spectral
Eg Warren et al (1995)
General uncomodulated
Eg Howard-Jones & Rosen (1993), Buss et al (2003)
Evidence from distorted speech
e.g. Drullman (1995) filtered noisy speech into 24 ¼-octave bands, extracted
the temporal envelope in each band, and replaced those parts of the envelope
below a target level with a constant value. Found intelligibility of 60% when
98% of signal was missing
Glimpsing in natural conditions:
the dominance effect
Although audio signals add
‘additively’, the occlusion metaphor
is more appropriate due to loglike
compression in the auditory system
Consequently, most regions in a mixture
are dominated by one or other source,
leaving very few ambiguous regions,
even for a pair of speech signals mixed
at 0 dB.
Issues for a glimpsing model
What constitutes a useful glimpse?
Glimpse detection
Is sufficient information contained in
glimpses?
How do listeners detect glimpses?
How can they be integrated?
Glimpse integration
Glimpsing study
Aims
– Determine if glimpses contain sufficient information
– Explore definition of useful glimpse
•
•
•
•
Comparison between listeners and model using natural VCV stimuli
Subset of Shannon et al (1999) corpus
V = /a/
C = { b, d, g, p, t, k, m, n, l, r, f, v, s, z, sh, ch }
Background source
– reversed multispeaker babbler for N=1, 8
– Allows variation in glimpsing opportunities
– 3 SNRs (TMRs): 0, -6 and -12 dB
12 listeners heard 160 tokens in each condition
– 2 repeats X 16 VCVs X 5 male speakers
Identification results
1-speaker
8-speaker
Glimpsing model
• CDHMM employing missing data techniques
• 16 whole-word HMMs
– 8 states
– 4 component Gaussian mixture per state
• Input representation
– 10 ms frames of modelled auditory excitation pattern (40
gammatone filters, Hilbert envelope, 8 ms smoothing)
– NB: only simultaneous masking is modelled
• Training
– 8 repetitions of each VCV by 5 male speakers per model
• Testing
– As for listeners viz. 2 repetitions of each VCV by 5 male speakers
– Performance in clean: > 99%
Model performance
I: ideal glimpses
Ideal glimpses
• All time-frequency regions whose
local SNR exceeds a threshold
• Optimum threshold = 0 dB
• For this task, there is more than
sufficient information in the
glimpsed regions
• Listeners perform suboptimally
with respect to this glimpse
definition
1
8
Model performance:
variation in detection threshold
Q Can varying the local SNR
threshold for glimpse detection
prodce a better match?
•
•
No choice of local SNR threshold
provides good fit to listeners
Closest fit shown (-6 dB)
1
8
Analysis
• Unreasonable to expect listeners to detect individual glimpses in
a sea of noise unless glimpse region is large enough
Analysis
• Unreasonable to expect listeners to detect individual glimpses in
a sea of noise unless glimpse region is large enough
Model performance:
useable glimpses
• Definition: glimpsed region
must occupy at least N ERBs
and T ms
• Search over 1-15 ERBs, 10100 ms, at various detection
thresholds
• Best match at
– 6.3 ERBs (9 channels)
– 40 ms
– 0 dB local SNR threshold
1
8
• Howard-Jones & Rosen (1993) suggested 2-4 bands limit for
uncomodulated glimpsing
• Buss et al (2003) found evidence for uncomodulated glimpsing in up to
9 bands
Consonant identification
identification of individual consonants
listeners
•
model
100
Reasonable matches
overall apart from
b, s & z
90
80
70
%
60
•
•
However, little tokenby-token agreement
between common
listener errors and
model errors.
Why?
50
40
30
20
10
0
b
p
d
t
g
k
l
r
m
n
s
sh
ch
v
f
z
all
listeners
45
61
64
83
76
70
50
72
68
77
90
91
79
54
63
92
71
model
85
68
73
83
85
85
58
67
60
78
35
73
87
60
78
53
71
Factors
Audibility of target
Energetic masking
‘Confusability’
Informational masking
Successful
identification
Organisational
cues in target
Organisational
cues in background
Existence of
schemas for target
Existence of
schemas for
background
Measuring energetic masking
100
Results
• Little difference for 1-speaker
background, suggesting
relatively low contribution of info
masking in this case (due to
reversed masker?)
• Larger difference for 8-speaker
case possibly due to ‘unrealistic’
glimpses
5
10
15
90
20
25
30
35
40
80
1
10
20
30
40
10
20
30
40
50
60
70
80
90
100
50
60
70
80
90
100
5
10
15
20
25
30
35
correct identification
Approach: resynthesise glimpses
data1
data2
alone
data3
data4
• Filter, time-reverse, refilter todata5
data6
data7
remove phase distortion
data8
• Select regions based on local
SNR mask
70
40
med4
5
10
15
20
60
25
30
35
40
50
10
20
30
40
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
8
40
5
10
15
glimpses alone
20
25
30
35
40
30
20
speech+noise
-12
-6
Target-to-masker ratio (dB)
0
Comparison with ideal model
100
Results
• Ideal model performs well in excess
of listeners when supplied with
precisely the same information
80
correct identification
Possible reasons:
• Distortions
• Glimpses do not occur in isolation:
possibility that a noise background
will help
• Lack of nonsimultaneous masking
model will inflate model
performance
90
70
60
50
40
Ideal (model)
30
20
Ideal? (listeners)
-12
-6
Target-to-masker ratio (dB)
0
The glimpse decoder
•
•
Attempt at a unifying statistical theory for primitive and model-driven
processes in CASA
Basic idea: decoder not only determines the most likely speech
hypothesis but also decides which glimpses to use
– Key advantage: no longer need to rely on clean acoustics!
•
•
Can interpret (some) informational masking effects as the incorrect
assignment of glimpses during signal interpretation
Barker, J, Cooke, M.P. & Ellis, D.P.W. “Decoding speech in the presence
of other sources”, accepted for Speech Communication
Summary & outlook
• Proposed a glimpsing model of speech identification in noise
• Demonstrated sufficiency of information in target glimpses, at
least for VCV task
• Preliminary definition of useful glimpse gives good overall
model-listener match
• Introduced 2 procedures for measuring the amount of energetic
masking (i) via ASR (ii) via glimpse resynthesis
• Need nonsimultaneous masking model
• Need to isolate affects due to schemas
• Repeat using non-reversed speech to introduce more
informational masking
• Need to quantify affect of distortion in glimpse resynthesis
• …
Masking noise can be beneficial
Warren et al (1995) demonstrated spectral induction effect with 2 narrow
bands of speech with intervening noise
80
70
Keywords correct (%)
60
50
40
30
20
10
0
CF 633 Hz
CF 4200 Hz
-Inf
-40
-20
-10
0
Noise level relative to speech (dB)
fullband
Cooke & Cunningham (in prep) Spectral induction with
single speech-bands.
Speech modulated noise
Speech modulated noise
• As in Brungart (2001)
• Model results and glimpse
distributions indicate increase in
energetic masking for this type of
masker
Natural speech
natural, 1 spkr
natural, 8 spkr
SMN, 1 spkr
SMN, 8 spkr
Speech modulated noise
Speech modulated noise
•
Listeners perform better with
SMN than predicted on the
basis of reduced glimpses (cf
SMN model), but not quite as
well as they do with natural
speech masker
•
Suggests energetic masking
is not the whole story (cf
Brungart, 2001), but further
work needed to quantify
relative contribution of
– Release from IM
– Absence of background
models/cues
1
NAT
(model)
NAT
(listeners)
SMN
8
(listeners)
SMN
(model)