CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to Douglas Reynolds.
Download
Report
Transcript CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to Douglas Reynolds.
CS 224S / LINGUIST 285
Spoken Language Processing
Dan Jurafsky
Stanford University
Spring 2014
Lecture 15: Speaker Recognition
Lots of slides thanks to Douglas Reynolds
Why speaker recognition?
Access Control
physical facilities
websites, computer networks
Transaction Authentication
telephone banking
remove credit card purachse
Law Enforcement
forensics
surveillance
Speech Data Mining
meeting summarization
lecture transcription
slide text from Douglas Reynolds
Speaker Recognition Tasks
Three Speaker Recognition Tasks
Identification
Verification/Authentication/
Detection
?
Whose
Whose voice
voice is
is this?
this?
?
Is
Is this
this Bob’s
Bob’s voice?
voice?
?
?
?
Segmentation and Clustering (Diarization)
Where
Where are
are speaker
speaker
changes?
changes?
Speaker A
Speaker B
slide from Douglas Reynolds
Which
Which segments
segments are
are from
from
the
same
speaker?
the same speaker?
Two kinds of speaker verification
Text-dependent
Users have to say something specific
easier for system
Text-independent
Users can say whatever they want
more flexible but harder
Phases of Speaker Detection System
Two
phases
totospeaker
detection
Two distinct
phases
any speaker verification
system
Enrollment
Phase
Enrollment speech for
Model for each
speaker
each speaker
Bob
Feature
Feature
extraction
extraction
Model
Model
training
training
Sally
Detection
Phase
11
Bob
Sally
Feature
Feature
extraction
extraction
Detection
Detection
decision
decision
Detected!
Hypothesized identity:
Sally
Lincoln Laboratory
slide from Douglas MIT
Reynolds
Detection: Likelihood Ratio
Two-class hypothesis test:
H0: X is not from the hypothesized speaker
H1: X is from the hypothesized speaker
Choose the most likely hypothesis
Likelihood ratio test:
slide from Douglas Reynolds
Speaker ID
Log-Likelihood Ratio Score
LLR= Λ =log p(X|H1) − log p(X|H0)
Need two models
Hypothesized speaker model for H1
Alternative (background) model for H0
slide from Douglas Reynolds
How do we get H1?
Pool speech from several speakers
and train a single model:
a universal background model (UBM)
can train one UBM and use as H1 for all
speakers
Should be trained using speech representing
the expected impostor speech
Same type speech as speaker enrollment
(modality, language, channel)
Slide adapted from Chu, Bimbot, Bonastre, Fredouille, Gravier, Magrin-Chagnolleau, Meignier, Merlin, Ortega-Garcia, PetrovskaDelacretaz, Reynolds
How to compute P(H|X)?
Gaussian Mixture Models (GMM)
The traditional best model for text-
independent speaker recognition
Support Vector Machines (SVM)
More recent use of discriminative
model
Speaker Models
Form of Hidden
GMM/HMM
depends
on
Markov Models
application
Form of HMM depends on the application
Fixed Phrase
Word/phrase models
“Open sesame”
Prompted phrases/passwords
/e/
/t/
Text-independent
Phoneme models
/n/
single state HMM (GMM)
General speech
slide from Douglas Reynolds
GMMs for speaker recognition
A Gaussian mixture model
(GMM) represents features
as the weighted sum of
multiple Gaussian
distributions
M o d el
p( x | )
Each Gaussian state i has a
μi
Covariance
i
Weight wi
Mean
Dim 2
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1
Recognition Systems
Gaussian Mixture Models
wi
Parameters
μi
p( x)
i
Dim 2
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1
Recognition Systems
Gaussian Mixture Models
p( x)
Parameters
Model Components
Dim 2
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1
GMM training
Training Features
During training, the
system learns about the
data it uses to make
decisions
A set of features are
collected from a speaker
(or language or dialect)
x1
x2
Dim 2
Dim 1
Model
p( x)
Dim 2
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1
Recognition Systems for Language,
Dialect,
Speaker
ID
Languages,
Dialects,
or Speakers
M odel 2
p ( x | C )
Parameters
M odel 1
M odel 3
Model Components
In LID, DID, and SID,
we train a set of target models C
for each dialect, language, or speaker
Dim 2
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1
Recognition Systems
Universal Background Model
p ( x | C )
Parameters
M o d el C
Model Components
We also train a universal background
model C representing all speech
Dim 2
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1
Recognition Systems
Hypothesis Test
Given a set of test
observations, we perform
a hypothesis test to
determine whether a
certain class produced it
X test { x 1 , x 2 ,
Dim 2
H0 :
X test is fro m th e h yp o th esized class
H1 :
X test is n o t fro m th e h yp o th esized class
, xK }
Dim 1
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Recognition Systems
Hypothesis Test
Given a set of test
observations, we perform
a hypothesis test to
determine whether a
certain class produced it
X test { x 1 , x 2 ,
H0 :
X test is fro m th e h yp o th esized class
H1 :
X test is n o t fro m th e h yp o th esized class
p ( x | 1 )
, xK }
H0?
Dim 2
Dim 1
p ( x | C )
H1 ?
Dim 2
Dim 1
Dim 2
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1
Recognition Systems
Hypothesis Test
Given a set of test
observations, we perform
a hypothesis test to
determine whether a
certain class produced it
X test { x 1 , x 2 ,
p ( x | 1 )
, xK }
Dan?
Dim 2
Dim 1
p ( x | C )
UBM (not Dan)?
Dim 2
Dim 1
Dim 2
Nicolas Malyska, Sanjeev Mohindra, Karen Lauro, Douglas Reynolds, and Jeremy Kepner
Dim 1
More details on GMMs
Adapted GMMs
Instead of training speaker model on only speaker data
•
•
Adapt the UBM to that speaker
The
basic advantage
idea is to start
a single background model
takes
of allwith
the data
that represents general speech
MAP adaptation: new mean of each Gaussian is a weighted
Using target speaker training data, “tune” the general
mixmodel
of the to
UBM
the speaker
speech
theand
specifics
of the target speaker
– This
“tuning”
is donemore
via unsupervised
Bayesian
Weigh
the speaker
if we have more
data:adaptation
μi =α Ei(x) + (1−α) μi
x
x
x
x
x
x
x
x
α=n/(n+16)
Target
training
data
x
UBM
Target
Model
Gaussian mixture models
Features are normal MFCC
can use more dimensions (20 + deltas)
UBM background model: 512–2048
mixtures
Speaker’s GMM: 64–256 mixtures
Often combined with other classifiers
in mixture-of-experts
SVM
Train a one-versus-all discriminative classifier
Various kernels
Combine with GMM
Other features
Prosody
Phone sequences
Language Model features
Speaker information of word bigrams
Doddington (2001)
Bigram is just the occurrence of two tokens in a sequence
Word bigrams can be very informative about speaker identity
Evaluation Metric
Trial: Are a pair of audio samples spoken by the
same person?
Two types of errors:
False reject = Miss: incorrectly reject a true trial
Type-I error
False accept: incorrectly accept false trial
Type-II error
Performance is trade-off between these two errors
Controlled by adjustment of the decision threshold
slide from Douglas Reynolds
ROC and DET curves
P(false reject) vs. P(false accept) shows system performance
slide from Douglas Reynolds
DET curve
Application operating point depends on relative costs of the two errors
slide from Douglas Reynolds
Evaluation Design
Data Selection Factors
Evaluation
tasks
•
Performance numbers are only meaningful when evaluation
Performance
depend on evaluation conditions
conditions numbers
are known
Speech quality
–
–
–
Channel and microphone characteristics
Ambient noise level and type
Variability between enrollment and
verification speech
Speech modality
–
–
Fixed/prompted/user-selected phrases
Free text
Speech duration
–
Duration and number of sessions of
enrollment and verification speech
Speaker population
–
–
Size and composition
Experience
The evaluation data and design should match the
target application domain of interest
slide from
Douglas Laboratory
Reynolds
MIT Lincoln
Rough historical trends in performance
slide from Douglas Reynolds
Milestones in the NIST SRE Program
1992 – DARPA: limited speaker id evaluation
1996 – First SRE in current series
2000 – AHUMADA Spanish data, first non-English speech
2001 – Cellular data
2001 – ASR transcripts provided
2002 – FBI “forensic” database
2005 – Mutiple languages with bilingual speakers
2005 – Room mic recordings, cross-channel trials
2008 – Interview data
2010 – New decision cost function: lower FA rate region
2010 – High and low vocal effort, aging
2011 –broad range of conditions, included noise and reverb
From Alvin Martin’s 2012 talk on the NIST SR Evaluations
Metrics
Equal Error Rate
Easy to understand
Not operating point of interest
FA rate at fixed miss rate
E.g. 10%
May be viewed as cost of listening to false
alarms
Decision Cost Function
From Alvin Martin’s 2012 talk on the NIST SR Evaluations
Decision Cost Function CDet
Weighted sum of miss and false alarm error
probabilities:
CDet = CMiss × PMiss|Target × PTarget
+ CFalseAlarm× PFalseAlarm|NonTarget × (1PTarget)
Parameters are the relative costs of detection
errors, CMiss and CFalseAlarm, and the a priori
probability of the specified target speaker, Ptarget:
‘96-’08
2010
CMiss
10
1
CFalseAlarm
1
1
PTarget
0.01
.001
From Alvin Martin’s 2012 talk on the NIST SR Evaluations
Accuracies
From Alvin Martin’s 2012 talk on the NIST SR Evaluations
How good are humans?
Bruce E. Koenig. 1986. Spectrographic voice identification: A forensic survey. J. Acoust. Soc. Am, 79(6)
Survey of 2000 voice IDs made by trained FBI employees
select similarly pronounced words
use spectrograms (comparing formants, pitch, timing)
listen back and forth
Evaluated based on "interviews and other evidence in
the investigation" and legal conclusions
No decision
Non-match
Match
65.2% (1304)
18.8% (378)
15.9% (318)
FR = 0.53% (2)
FA = 0.31% (1)
Speaker diarization
ROCESSING, VOL. 14, NO. 5, SEPTEMBER 2006
1557
w of Conversational
Automatic
Speaker
telephone speech
2 speakers
rization
Systems
Broadcast news
EEE and Douglas A. Reynolds, Senior Member, IEEE
Many speakers although often in dialogue (interviews)
or in sequence (broadcast segments).
nnotating an
utes (possibly
their specific
akers, music,
channel charh recognition,
archives, and
making them
Tranter and Reynolds 2006
rview of the
Meeting
recordings
o diarization,
Fig. 1. Example of audio diarization on broadcast news. Annotated
lative merits
phenomena
may include
different
structuraland
regions
such as commercials,
Many
speakers,
lots
of
overlap
disfluencies
echniques are different acoustic events such as music or noise, and different speakers. (Color
zation task in version available online at http://ieeexplore.ieee.org.)
Speaker diarization
Tranter and Reynolds 2006
Step 1: Speech Activity Detection
Meetings or broadcast:
Use supervised GMMs
two models: speech/non-speech
or could have extra models for music, etc.
Then do Viterbi segmentation, possibly with
minimum length constraints or
smoothing rules
Telephone
Simple energy/spectrum speech activity detection
State of the art:
Broadcast: 1% miss, 1-2% false alarm
Meeting: 2% miss, 2-3% false alarm
Tranter and Reynolds 2006
Step 2: Change Detection
1. Look at adjacent windows of data
2. Calculate distance between them
3. Decide whether windows come from same source
Two common methods:
To look for change points within window use likelihood ratio
test to see if better modeled by one distribution or two.
If two, insert change and start new window there
If one, expand window and check again
represent each window by a Gaussian, compare
neighboring windows with KL distance, find peaks in
distance function, threshold
Tranter and Reynolds 2006
Step 3: Gender Classification
Supervised GMMs
If doing Broadcast news, also do bandwidth
classification (studio wideband speech versus
narrowband telephone speech)
Tranter and Reynolds 2006
Step 4: Clustering
Hierarchical agglomerative clustering
1.
2.
3.
4.
5.
initialize leaf clusters of tree with speech segments;
compute pair-wise distances between each cluster;
merge closest clusters;
update distances of remaining clusters to new cluster;
iterate steps 2-4 until stopping criterion is met
Tranter and Reynolds 2006
Step 5: Resegmentation
Use final clusters and non-speech models
To resegment data via Viterbi decoding
Goal:
refine original segmentation
fix short segments that may have been removed
Tranter and Reynolds 2006
TDOA features
For meetings, with multiple-microphones
Time-Delay-of-Arrival (TDOA) features
correlate signals from mikes and figure out time shift
used to sync up multiple microphones
and as a feature for speaker localization
assume the speaker doesn’t move, so they are near the same
microphone
Evaluation
Systems give start-stop times of speech segments with
speaker labels
nonscoring “collar” of 250 ms on either side
DER (Diarization Error Rate)
missed speech (% of speech in the ground-truth but not in
the hypothesis)
false alarm speech (% of speech in the hypothesis but not in
the ground-truth)
speaker error (% of speech assigned to the wrong speaker)
Recent mean DER for Multiple Distant Mikes (MDM): 8-10%
Recent mean DER for SDM: 12-18%
Summary:Speaker
Speaker
Recognition
Recognition
Tasks Tasks
Identification
Verification/Authentication/
Detection
?
Whose
Whose voice
voice is
is this?
this?
?
Is
Is this
this Bob’s
Bob’s voice?
voice?
?
?
?
Segmentation and Clustering (Diarization)
Where
Where are
are speaker
speaker
changes?
changes?
Speaker A
Speaker B
slide from Douglas Reynolds
Which
Which segments
segments are
are from
from
the
same
speaker?
the same speaker?