推研介绍 - Asia Pacific Signal and Information

Download Report

Transcript 推研介绍 - Asia Pacific Signal and Information

APSIPA Distinguished Lecture Plan 2012-2013
Speaker Recognition Systems:
Paradigms and Challenges
Thomas Fang Zheng
Co-work with: Linlin Wang and Xiaojun Wu
<Date>, <Venue>
Center for Speech and Language Technologies, Tsinghua University
Asia-Pacific Signal and Information Processing Association
2
About APSIPA
 Asia-Pacific Signal & Information Processing Association
 An emerging association to promote broad spectrum of research and
education activities in SIP
 Mission: non-profit organization with the following objectives:
 Providing education, research and development exchange platforms for both
academia and industry;
 Organizing common-interest activities for researchers and practitioners;
 Facilitating collaboration with region-specific focuses and promoting leadership for
worldwide events;
 Disseminating research results and educational material via publications,
presentations, and electronic media;
 Offering personal and professional career opportunities with development
information and networking
 Established on October 5, 2009, officially registered in Hong Kong
 APSIPA ASC (Annual Summit and Conference) starting from 2009
 APSIPA Transactions on Signal & Information Processing
 APSIPA Distinguished Lecture Program starting from Jan. 2012
 http://www.apsipa.org
APSIPA Distinguished Lecture Plan 2012-2013
3
Outline
Introduction
Creation of Time-varying Voiceprint Database
The Discrimination-emphasized Mel-frequencywarping Method
Experimental Results
Conclusions & Future Work
APSIPA Distinguished Lecture Plan 2012-2013
4
Biometric Recognition
Technologies for measuring and analyzing a
person's physiological or behavioral
characteristics. These can be used to verify or
identify a person.
The term "biometrics" is derived from the Greek
words bio (life) and metric (to measure).
APSIPA Distinguished Lecture Plan 2012-2013
5
Examples of Biometrics
 Face
 Fingerprint
 Palmprint
 Hand Geometry
 Iris
 Retina Scan
 DNA
 Signatures
 Gait
 Keystroke
 Voiceprint
APSIPA Distinguished Lecture Plan 2012-2013
6
Rich Information Contained in Speech
Where is he/she from?
What language was spoken?
Accent Recognition
Language Recognition
Speech Recognition
Emotion Recognition
Gender Recognition
Speaker Recognition
Positive? Negative?
Happy? Sad?
Male or Female?
Who spoke?
What was spoken?
APSIPA Distinguished Lecture Plan 2012-2013
7
Speaker Recognition / Voiceprint Recognition
 Speaker recognition (or Voiceprint recognition) is the
process of automatically identifying or verifying the identity
of a person from his/her voice, using the characteristic
vocal information included in speech. It enables access
control of various services by voice. [Kunzel 94][Furui 97]
 Various applications:
 Access control (e.g.: security control for confidential information,
remote access of computers, information and reservation services);
 Transaction authentication (e.g.: telephone banking, telephone
shopping);
 Security and forensic prospects (e.g.: public security, criminal
verification);
 Rich Transcription for Conference Meeting (e.g.: "Who Spoke When"
and "Who Spoke What" speaker diarization);
 etc.
APSIPA Distinguished Lecture Plan 2012-2013
8
Speaker Recognition Categories
 Speaker Identification
Determining which identity in a specified speaker set is
speaking during a given speech segment.
Closed-Set / Open-Set
 Speaker Verification
Determining whether a claimed identity is speaking
during a speech segment. It is a binary decision task.
 Speaker Detection
Determining whether a specified target speaker is
speaking during a given speech segment.
 Speaker Tracking (Speaker Diarization = Who Spoke When)
Performing speaker detection as a function of time,
giving the timing index of the specified speaker.
APSIPA Distinguished Lecture Plan 2012-2013
9
Performance Evaluation
(for verification and open-set identification)
Detection Error Trade-off (DET) Curve
A plot of error rates for binary classification systems,
plotting false rejection rate (FRR) vs. false acceptance rate
(FAR).
Equal Error Rate (EER)
The error rate corresponding to the location on a DET
curve where FAR and FRR are equal.
Minimum Detection Cost Function (MinDCF)
Cdet=Cmiss X Pmiss X PTarget + CFalseAlarm X PFalseAlarm X (1-Ptarget)
APSIPA Distinguished Lecture Plan 2012-2013
10
Open Issues for Speaker Recognition Research [Furui 1997]
 1. How can human beings correctly recognize speakers?
 2. Is it useful to study the mechanism of speaker recognition by human beings?
 3. Is it useful to study the physiological mechanism of speech production to get
new ideas for speaker recognition?
 4. What feature parameters are appropriate for speaker recognition?
 5. How can we fully exploit the clearly evident encoding of identity in prosody and
other supra-segmental features of speech?
 6. Is there any feature that can separate speakers whose voices sound identical,
such as twins or imitators?
 7. How do we deal with long term variability in people's voices (ageing)?
 8. How do we deal with short term alteration due to illness, emotion, fatigue, …?
 9. What are the conditions that speaker recognition must satisfy to be practical?
 10. What about combing speech and speaker recognition?
Furui, S., "Recent Advances in Speaker Recognition," Pattern Recognition Letters 18
(1997) 859-872
APSIPA Distinguished Lecture Plan 2012-2013
11
Performance Factors for Speaker Recognition
Factors affecting the speaker recognition system
performance:
The
The
The
The
The
quality of the speech signal
length of the training speech signal
length of the testing speech signal
size of the population tested by the system
phonetic content of the speech signal
APSIPA Distinguished Lecture Plan 2012-2013
12
Key Issues for Robust Speaker Recognition
Cross Channel
Multiple Speakers
Background Noise
Emotions
Short Utterance
Time-Varying (or Ageing)
APSIPA Distinguished Lecture Plan 2012-2013
13
Time-Varying (or Ageing) Issue
In all these typical situations, training and testing
are usually separated by some period of time,
which poses a possible threat to speaker
recognition systems.
TIME GAP
APSIPA Distinguished Lecture Plan 2012-2013
14
 “Ever-newer waters flow on those who step into the same rivers.”
 —— Heraclitus
APSIPA Distinguished Lecture Plan 2012-2013
15
Open Questions
 “Does the voice of an adult change significantly with time?
If so, how?” [Kersta 1962]
 “How to deal with the long-term variability in people’s voice?
Whether there was any systematic long-term variation that
helped update speaker models to cope with the gradual
changes in people’s voices? ” [Furui 1997]
 “Voice changes over time, either in the short-term (at
different times of day), the medium-term (times of the year),
or in the long-term (with age).” [Bonastre et al. 2003]
APSIPA Distinguished Lecture Plan 2012-2013
16
Observations
Performance degradation in presence of time
intervals
The longer the separation between the training and the
testing recordings, the worse the performance. [Soong et al.
1985]
A significant loss in accuracy (4~5% in EER) between two
sessions separated by 3 months was reported [Kato &
Shimizu 2003] and ageing was considered to be the cause
[Hebert 2008].
Few researchers have figured out reasons behind
this time-varying phenomenon exactly.
APSIPA Distinguished Lecture Plan 2012-2013
17
More enrollment data -- a solution?
Using training data with a larger time span [Markel
1979]
Performance can be improved.
The enrollment is quite time-consuming!
In some situation, it is impractical to obtain such data!
Accepted testing/recognition speech segments be
augmented to previous enrollment data to retrain
the speaker model [Beigi 2009, Beigi 2010]
Performance can be improved.
Initial training data should be kept for later use (storageconsuming)!
APSIPA Distinguished Lecture Plan 2012-2013
18
Ageing-dependent decision boundary -- a solution?
Using ageing-dependent decision boundary in the
score domain [Kelly 2011, Kelly 2012]
Performance can be improved.
How to determine the time lapse practically?
APSIPA Distinguished Lecture Plan 2012-2013
19
Model-updating (adaptation) -- a solution?
A simple and straightforward way [Lamel 2000, Beigi
2009, Beigi 2010]:
to update speaker models from time to time
It is effective to maintain representativeness.
However, it is costly, user-unfriendly, and
sometimes, perhaps unrealistic.
And feature matters.
APSIPA Distinguished Lecture Plan 2012-2013
20
Efforts in frequency domain …
The most essential way to stabilize performance is
to extract exact acoustic features that are
speaker-specific and further, stable across
sessions.
This is more like a dream for a long period!
To take some findings into existing techniques…
 NUFCC [Lu & Dang 2007]: assign frequency bands with different
resolution according to their discrimination sensitivity for
speaker-specific information.
APSIPA Distinguished Lecture Plan 2012-2013
21
The idea of mel-frequency-warping!
 To emphasize frequency bands that are more sensitive to
speaker-specific information, yet not so sensitive to timerelated session-specific information.
 Identify frequency bands that reveal high discrimination
sensitivity for speaker-specific information but low
discrimination sensitivity for session-specific information.
 Once these frequency bands are identified, more features
can be extracted within them by means of frequency
warping.
 The Discrimination-emphasized Mel-frequency-warping
method.
APSIPA Distinguished Lecture Plan 2012-2013
22
Outline
Introduction
Creation of Time-varying Voiceprint Database
The Discrimination-emphasized Mel-frequencywarping Method
Experimental Results
Conclusions & Future Work
APSIPA Distinguished Lecture Plan 2012-2013
23
MARP Corpus
 A proper longitudinal database is necessary.
 Time-related variability is the only focus.
 The MARP corpus has been the only one published so far
2009], though there were more variabilities.
[Lawson
 The MARP corpus
 32 participants, 672 sessions from June 2005 to March 2008
 10 minutes of free-flowing conversations for each session
 “While the impact on speaker recognition accuracy between any two
sessions is considerable, the long-term trend is statistically quite
small.”
 “The detrimental impact is clearly not a function of ageing or of the
voice changing within this timeframe.”
APSIPA Distinguished Lecture Plan 2012-2013
24
In free-flowing conversations, speech contents are
not fixed and a speaker’s emotion, speaking style,
or engagement can be easily influenced by
his/her partner.
Hence, creation of a voiceprint database which
specially focuses on the time-varying effect in
speaker recognition is imperative for both
research and practical applications.
APSIPA Distinguished Lecture Plan 2012-2013
25
Database Design Principles
The time-varying effect is the only focus, therefore
other factors should be kept as constant as
possible throughout all recording sessions.
recording equipments, software, conditions,
environment, and so on
In the database design, two major factors were
well considered:
prompt texts design, and
time intervals design.
APSIPA Distinguished Lecture Plan 2012-2013
26
Fixed Prompt Texts
Speakers were requested to utter in a reading way
with fixed prompt texts instead of free-style
conversations.
Prompt texts were designed to remain unchanged
throughout all recording sessions.
To avoid or at least reduce the impact of speech
contents on speaker recognition accuracy.
In form of sentences and isolated words.
APSIPA Distinguished Lecture Plan 2012-2013
27
 100 Chinese sentences and 10 isolated Chinese words
 The length of each sentence ranges from 8 to 30 Chinese
characters with an average of 15.
 Each isolated Chinese word contained 2 to 5 Chinese
characters and was read five times in each session.
 Of the 10 isolated words, 5 were unchanged throughout all
sessions just like the sentences, while
 the other 5 changed from session to session and reserved for
future research of other purpose.
APSIPA Distinguished Lecture Plan 2012-2013
28
Initials
Finals
di-IFs
Number covered in
prompt texts
23
38
1,183
Total
number
23
Percentage
(%)
100
38
1,523
100
78
Table 1. Acoustic coverage of prompt texts
APSIPA Distinguished Lecture Plan 2012-2013
29
Gradient Time Intervals
Gradient time intervals were used.
no precedent reference of time-interval design.
costly and perhaps unnecessary to record in a fixedlength time interval for more than 10 times to obtain a
possible trend.
Initial sessions can be of shorter time intervals,
while following sessions of longer and longer time
intervals.
impacts of different time intervals can be easily
analyzed.
APSIPA Distinguished Lecture Plan 2012-2013
30
 16 sessions from January 2010 to 2012
 Five different time intervals are used: one week, one
month, two months, four months and half a year, as
illustrated in the figure below.
 The design of time intervals exactly voids the recordings in
summer or winter vacations.
 In actual recording it is unrealistic to make all speakers
record exactly on one specific day, so the session day is
made flexible to a session interval.
sessions
time
Figure 1. Illustration of different time intervals and session days
APSIPA Distinguished Lecture Plan 2012-2013
31
Speakers
60 fresh students, w/ 30M + 30F.
Born in years between 1989 and 1993 with a
majority in year 1990.
From various departments
such as computer science, biology, English,
humanities, and journalism
All of them speak standard Chinese well.
APSIPA Distinguished Lecture Plan 2012-2013
32
Recording conditions
An ordinary room in the laboratory for recording.
no burst noise but environmental noise in a low level.
Prompt texts were requested to read in a normal
speaking rate, while the volume can be controlled
by the recording software.
Most of the speakers could complete a session in about
25 minutes smoothly.
Speech signals are digitalized at 8 kHz / 16 kHz
sampling rates simultaneously in 16-bit
precision.
10 recording sessions had been finished so far.
APSIPA Distinguished Lecture Plan 2012-2013
33
Database evaluation -- a first and quick look
Experimental setup
1024-mixture GMM-UBM system with 32-dim MFCCs
Experimental results
The system performs best when training and testing
utterances are taken from the same session.
However, performance gets worse and worse with the
recording date difference between training and testing
gets bigger.
Figure 2. EER curves when using different sessions for model training
APSIPA Distinguished Lecture Plan 2012-2013
34
Outline
Introduction
Creation of Time-varying Voiceprint Database
The Discrimination-emphasized Mel-frequencywarping Method
Experimental Results
Conclusions & Future Work
APSIPA Distinguished Lecture Plan 2012-2013
35
How to find IMPORTANT frequency bands?
The proposed solution is to highlight in feature
extraction the frequency bands
that reveal high discrimination sensitivity for speakerspecific information while low discrimination sensitivity
for session-specific information.
How to determine the discrimination sensitivity of each
frequency band?
 F-ratio serves as a criterion to produce the discrimination
scores
How to perform frequency warping to highlight target
frequency bands?
 Frequency warping on the basis of mel-scale
APSIPA Distinguished Lecture Plan 2012-2013
36
F-ratio
[Wolf 1972]
The ratio of the between-group variance to the withingroup variance.
A higher F-ratio value means better feature selection for
the target grouping.
That is to say, the feature selection with a higher Fratio possesses higher discrimination sensitivity against
the target grouping.
APSIPA Distinguished Lecture Plan 2012-2013
37
F-ratio in time-varying speaker recognition tasks
There exist two kinds of grouping: by speakers for each
session and by sessions for each speaker.
 The whole frequency range in divided into K frequency bands
uniformly.
 Linear frequency scale triangle filters are used to process the
power spectrum of utterances.
Two F-ratio values are obtained for each frequency band
APSIPA Distinguished Lecture Plan 2012-2013
38
M
F - ratio - spk 
k
s
 
i,s
i 1
 s 
2
Ni , s
2
1
k, j
x





 i,s i,s
i 1 Ni , s j 1
M
,
1 S
F - ratio - spk   F - ratio - spksk .
S s 1
k
……
S
……
F - ratio - ssnik 
……
……
……
……
……
……
……
……
 
i , s  i 
s 1
2
N
2
1 i ,s k , j
x



 i,s i,s 
s 1 Ni , s j 1
S
1 M
F - ratio - ssn   F - ratio - ssnik .
M i 1
k
Figure 3. An illustration of two kinds of grouping
APSIPA Distinguished Lecture Plan 2012-2013
.
39
For each frequency band k, a discrimination
score is defined as:
discrim _ score
(k )
F _ ratio _ spk ( k )

.
(k )
F _ ratio _ ssn
(1)
Target frequency bands with higher discrimination
scores should be assigned with a proper warpingfactor, neither too small to emphasize them, nor too
big, to increase the frequency resolution.
APSIPA Distinguished Lecture Plan 2012-2013
40
How to EMPHASIZE? Mel frequency warping (MFW)!
Warping strategies:
Uniformly warping of those target frequency bands
with discrimination scores above a threshold.
Figure 4. The relationship between Hz, Mel scale, and MFW scale
Non-uniformly warping of the whole frequency range
according to their discrimination scores.
APSIPA Distinguished Lecture Plan 2012-2013
41
Figure 5. A comparison of MFCC and WMFCC extraction procedures
APSIPA Distinguished Lecture Plan 2012-2013
42
Outline
Introduction
Creation of Time-varying Voiceprint Database
The Discrimination-emphasized Mel-frequencywarping Method
Experimental Results
Conclusions & Future Work
APSIPA Distinguished Lecture Plan 2012-2013
43
The discrimination for different bands …
Warping factor
1
2
3
4
5
EER (%)
10.06
8.69
8.14
8.22
8.36
Table 3. Performance comparison of WMFCC with different warping factors in average EER
Figure 7. Discrimination scores of frequency bands
APSIPA Distinguished Lecture Plan 2012-2013
44
Comparison
14
12
EER
10
8
6
4
2
MFCC
0
WMFCC
Reduction Rate (%)
2nd-session EER
(%)
1st
6.45
2nd 3rd
5.38
16.6
Average EER
(%)
4th
10.06
5th 6th 7th
8.14
Session
19.1
MFCC
Degradation Degree
(%)
8th
55.97
9th 10th Avg.
51.30
8.9
Standard
Deviation
1.83
1.32
27.9
WMFCC
Table 3. Performance comparison between MFCC and WMFCC in degradation degree
Figure 7. Performance comparison between MFCC and WMFCC in EER
APSIPA Distinguished Lecture Plan 2012-2013
45
Outline
Introduction
Creation of Time-varying Voiceprint Database
The Discrimination-emphasized Mel-frequencywarping Method
Experimental Results
Conclusions & Future Work
APSIPA Distinguished Lecture Plan 2012-2013
46
 A Discrimination-emphasized Mel-frequency-warping
method is proposed for time-varying speaker recognition.
 Experimental results show that in the time-varying
voiceprint database, this method can not only improve
speaker recognition performance in average EER with a
reduction of 19.1%, but also alleviate performance
degradation brought by time varying with a reduction of
8.9%. [WANG 2011, APSIPA ASC 2011 Excellent Student Paper Award]
 Future work
 Further experiments are needed to test the data-dependency by
using other databases.
 It requires more speculation and experimentation whether the
discrimination-emphasized idea could be applied to other speech
features, and further, speaker modeling techniques.
APSIPA Distinguished Lecture Plan 2012-2013
47
Thanks!
http://cslt.riit.tsinghua.edu.cn
http://www.apsipa.org
[email protected]
APSIPA Distinguished Lecture Plan 2012-2013
48
Update ...
Telephone banking application
2009年,得意音通与清华大学共同承担中国建设
银行《95533电话银行声纹身份认证系统》项目
,2010年该项目完成验收,2011年11月建行确
认“已正常运行满一年”。建设银行成为中国金
融领域首家应用声纹身份认证的银行。
招商银行 -- 下一个!
APSIPA Distinguished Lecture Plan 2012-2013