Separating simultaneous voices

Download Report

Transcript Separating simultaneous voices

Using Fo and vocal-tract length to
attend to one of two talkers.
Chris Darwin
University of Sussex
With thanks to :
• Rob Hukin
• John Culling
• John Bird
• MRC & EPSRC
1. Review past work on the way that the
human auditory system uses differences in
Fo to separate two voices;
2. Present new data on the use of Fo, vocaltract length and their combination to allow
listeners to select one of two simultaneous
messages.
Something old, something new, something borrowed, background blue.
Three types of experiment:
Difference in Fo leads to:
1. binaural separation of sound sources
2. increase in intelligibility
3. ability to track a sound source over time.
Three types of experiment:
Difference in Fo leads to:
1. binaural separation of sound sources
2. increase in intelligibility
3. ability to track a sound source over time.
Broadbent & Ladefoged (1957)
•
PAT-generated sentence
“What did you say before that?”
F1
• when Fo the same -125 Hz
(either natural or monotone),
• listeners heard:
F2
• when Fo different -125 /135
(monotone),
• listeners heard:
• one voice only 16/18
• two voices 15/18
• in one place
• in two places 12/18
18/18
B & L Conclusion
Common Fo integrates
– broadband frequency regions of a single voice
– coming simultaneously to different ears
into a single voice heard in one position.
Is a common Fo sufficient for
fusion?
• Broadbent & Ladefoged's stimuli used formant
resonators with broad low-frequency skirts.
• Sharply-filtered sounds sometimes give impression
of two sound sources even with common Fo.
Formant T(f) & abs difference
30
20
dB
10
0
-10
-20
-30
0
500
1000
1500
frequency
2000
apologies to Hideki
Dichotic : same Fo
PSOLA
Fo -> 0%
LP filter
Left ear
PSOLA
Fo -> 0%
HP filter
Right ear
original
Dichotic : different Fo
PSOLA
Fo -> - 4%
LP filter
Left ear
PSOLA
Fo -> + 4%
HP filter
Right ear
original
Complementary LP/HP filters
Variable
bandwidth
600 LP
600 HP
1400 LP
1400 HP
1
0.8
0.6
0.4
0.2
0
0
500
1000
1500
2000
Dichotic Results (female voice)
100
Filter X-over
@ 1 kHz
% fused
75
Same Fo
HP High Fo
HP Low Fo
50
25
0
0
1000
filte r transition width
2000
Higher filter cut-offs need
wider bandwidths
Same Fo
100
filte r cut-off
fre quency
% fused
75
600
800
1200
1400
2000
50
25
0
0
1000
filte r transition width
2000
Low-frequency overlap
600
800
1200
1400
2000
level difference (dB)
0
-20
-40
0
500
1000
1500
2000
frequency (Hz)
cf natural ILDs higher for low frequency sounds
Summary
Fusion at same
Fo?
Dichotic
Low-frequency
overlap needed
Fusion at Different Fo
(±4%)?
No
But what about Fo’s ability to separate
different voices? (original B & L question)
Three types of experiment:
Difference in Fo leads to:
1. binaural separation of sound sources
2. increase in intelligibility
3. ability to track a sound source over time.
DFo improves identification
• double
vowels over by
1 semitone
80
% correct
• sentences
improve for
longer
100
double vowels
60
sentences
40
Assmann & Summerfield 200ms
Brokx & Notteboom
20
0
0
2
4
6
semitones
8
10
12
Mechanisms of DFo improvement
• A. Global: Across formant grouping by Fo
(as originally conceived by B & L)
• B. Local: Better definition of individual
formants - especially F1 where harmonics
resolved
At small ∆Fos B more important than A for
double vowels
(Culling & Darwin, JASA 1993).
Also true for sentences?
DFo between two sentences
(Bird & Darwin 1998; after Brokx & Nooteboom, 1982)
Two sentences (same talker)
• only voiced consonants
• (with very few stops)
100
Normal
Masking sentence = 140 Hz ± 0,1,2,5,10 semitones
Target sentence Fo = 140 Hz
Task: write down target sentence
% words recognised
80
60
40
Perfect Fourth ~4:3
20
Replicates & extends Brokx & Nooteboom
40 Subjects
40 Sentence Pairs
0
0
2
4
6
8
Fo difference (semitones)
10
Chimeric sentences
(Bird & Darwin, Grantham Meeting 1998)
100-100
100-106
100-112
Fo below 800 Hz
100-133
Fo above 800 Hz
100-178
Paired sentences' Fos
Low Pass High Pass
Normal
100
112
100
112
Same Fo in High
100
112
100
100
Same Fo in Low
100
100
100
112
Swapped
100
(gives wrong gping) 112
112
100
Segregating sentence pairs by Fo
• all the action is
in the low
frequency region
(<800 Hz)
•
no strong
evidence of
across-formant
grouping
Adding Fo-swapped
•
inappropriate
pairing of Fo only
detrimental above
4 semitones
Summary of Fo-differences
• Across-formant grouping only significant
for large Fo differences (> ~ 4 semitones)
• Most of the improvement with small Fo
differences happens in the F1 frequency-
region.
another caveat for auto-correlation
• Improvement in identification of double
vowels for small ∆Fos is about as good when
each vowel is made up of alternating
harmonics of the two Fos (Culling & Darwin)
• Autocorrelation would pull out completely
wrong envelopes.
No simultaneous effect of FM
• Although separation by Fo shows strong
effects, there is no detectable effect of
simultaneous separation by different
Frequency Modulations of Fo.
• Listeners unable to discriminate correlated
from uncorrelated FM in simulataneous
inharmonic sine waves (Carlyon).
Summary of DFo effects in separating
competing voices
• Intelligibility increased by small DFo only in
F1 region (and harmonic alternation tolerated)...
• … but not by DFo in only higher freq.
region.
• Across-formant consistency of Fo only
important at larger DFo
• FM produces no additional separation
Three types of experiment:
Difference in Fo leads to:
1. binaural separation of sound sources
2. increase in intelligibility
3. ability to track a sound source over time.
CRM task (tracking a sound source)
(Bolia et al., 2000)
• 2 simultaneous sentences each of form
 Ready (Call Sign) go to (Color) (Number) now.
 Same talker (TT);
Same Sex (TS); Different sex (TD)
• Target denoted by Call-Sign "Baron"
• 8 Talkers in corpus, 2048 tokens
CRM task
(Bolia et al., 2000)
Listeners responded by selecting the
appropriate colored digit with the
computer mouse
CRM task results (Brungart et al)
Effect of change in Fo
Effect of change in Fo
Fo contours for 2 individuals
Call Sign Arrow
Call Sign Tiger
Call Sign Baron
Call Sign Eagle
200
150
100
50
0
200
150
100
50
0
0
1
Time (s)
2 0
1
Time (s)
0.30
y = -0.0262x + 0.4163
Av change with Fo
R2 = 0.9315
0.20
0.10
0.00
-0.10
0
5
10
%Fo difference
15
20
2 0
1
Time (s)
2 0
1
Time (s)
Individuals, with
most constant Fo
contours, show
most improvement
with ∆Fo
2
Effect of change of VT
Effect of joint change of Fo and VT
Original: male
Effect of joint change of Fo and VT
Original: female
Superadditivity of ∆Fo and ∆VT
actual d'
1.50
∆Fo & ∆VT
superadditive
1.00
0.50
male
female
0.00
0.00
0.50
1.00
predicted d'
1.50
… and still less
than real
different-sex
talkers
Conclusions
• Same Fo not a sufficient condition for
dichotic fusion for complemenarily filtered
speech.
• Intelligibility increase for small ∆Fo
confined to F1 region. Only across-formant
for larger ∆Fo.
• Fo & VT-size useful for tracking sources
across time. Superadditive.